Code
objects() [1] "RETICULATE_PYTHON"
These pages set out the process for loading, cleaning, reshaping and recoding provided datasets for calculating indicator values as a proof of concept, and creating a health profile from those indicators.
To facilitate reproducibility the analyses are written in R code in the form of a notebook and R scripts which do much of the pre-processing. The full code is made available as a Github repository at https://github.com/julianflowers/poc. The files can be cloned (downloaded)
In RStudio, go to “File > New Project”
Click on “Version Control:
Checkout a project from a version control repository”
Click on “Git:
Clone a project from a repository”
Fill in the info: URL: use HTTPS address
Create as a subdirectory of: Browse to where you would like to create this folder
4 sets of indicators are used for this PoC:
Methicillin resistant Staph aureus and 3rd generation cephalosporin resistant E.coli - proportion of samples tested which are resistant
Flu vaccination coverage rates
Smoking rates in women
Injury admission rates
Fully specifying the indicators
The proportion of blood samples which grow either Staph aureaus or E. coli, which are tested for antibiotic sensitivity and which are found to be resistant to oxacllin or 3rd generation cephalosporins respectively for time period X to Y, stratified by [area] / [time period] / [age group] / [gender]
The proportion of the population which has been vaccinated against flu for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]
The rate of smoking in women aged 15+ / 18-44, per 100,000 female population for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]
The rate of hospitalisation for injury for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]
As a first step we will run a script which does a number of things.
Imports the datasets for each indicator
Loads KSA 2022 census data by age, gender and region1
Makes variable names consistent
Recodes region names so that they match the names used in the Census 2022 data
Maps directorate names (used in smoking data) to region2
Ensures numeric variables (e.g. age) are converted to numbers
Creates a set of intermediate data tables
Saves file of reshaped population data,
The intermediate tables can be reused for further analysis and generating profiles.
The script can be run by typing
`source("~/proof-of-concept/scripts/pre-process.R")`
at the prompt in the console
This generates a set of objects in the R environment
objects() [1] "RETICULATE_PYTHON"
Objects called dfs reflect pre-processing of the original datasets. Objects containing names, contain the different variables used for age, gender and area variables across datasets. Those called sc are used to map locations of smoking clinics as part of recoding directorates to regions for the smoking data (see below).
The object regional_counts_complete is a data frame of regional age_band, gender specific counts for each indicator and forms the basis of indicator generation. Note, this includes region-age-band-gender combinations for which there is no data (because these combinations are not present in the original data - although they maybe in the full datasets).
regional_counts_complete <- read_csv("data/regional_counts.csv")
regional_counts_complete |>
head() |>
flextable::flextable()Region | age_band | sum_f_Smoking | sum_f_flu | sum_f_injury | sum_m_Smoking | sum_m_flu | sum_m_injury |
|---|---|---|---|---|---|---|---|
`Asir | [0,5) | 140 | 160 | 26 | 0 | ||
`Asir | [5,10) | 0 | 7 | 1 | 1 | ||
`Asir | [10,15) | 5 | 4 | 355 | 2 | ||
`Asir | [15,20) | 39 | 13 | 2,744 | 4 | ||
`Asir | [20,25) | 79 | 25 | ||||
`Asir | [25,30) | 1 | 76 | 21 | 22 |
The smoking data provided is clinic based data disaggregated at the level of health directorate. There are 20 directorates in KSA, and 13 regions.
Available population data is at regional level, so to generate population denominators for the smoking data I have mapped directorates to regions as follows:
I used a dataset which contained spatial locations of healthcare facilities (smoking cessation clinics) at health directorate level3
For each location I extracted spatial coordinates (longitude and latitude)
I obtained a regional boundary file (shape file) from https://data.humdata.org/dataset/41ce9023-1d21-4549-a485-94316200aba0/resource/99834c81-ad34-415e-91c5-af053d8e55b4/download/sau_capp_adm1_1m_ocha.zip
I spatially joined the clinic location and boundary files (see
This created a lookup table for directorates and regions and enabled the smoking data to be recoded to regions and calculation of rates using the census regional population estimates
sa_bound <- read_sf("data/ksa_bound.gml")
sc_ll_sf <- read_sf("data/smok_loc.gml")
sc_ll_sf |>
ggplot() +
geom_sf(data = sa_bound) +
geom_sf(aes(colour = name))
Note. The census data was downloaded from https://tableau.saudicensus.sa/#/views/TA3-PopulationbydetailedAgebyRegionGovernorateNationalityandGenderAR_16850208449070/PopulationbydetailedAgebyRegionGovernorateNationalityandGenderARCSV.csv and variable and region names translated to English using ChatGPT4o↩︎
This uses the spatial locations of smoking clinics which include directorate names to map to KSA regional boundaries↩︎
https://www.moh.gov.sa/en/Ministry/Projects/TCP/Pages/default.aspx↩︎