6 Data pre-processing

6.1 Getting started

These pages set out the process for loading, cleaning, reshaping and recoding provided datasets for calculating indicator values as a proof of concept, and creating a health profile from those indicators.

To facilitate reproducibility the analyses are written in R code in the form of a notebook and R scripts which do much of the pre-processing. The full code is made available as a Github repository at https://github.com/julianflowers/poc. The files can be cloned (downloaded)

In RStudio, go to “File > New Project”
Click on “Version Control:
Checkout a project from a version control repository”
Click on “Git:
Clone a project from a repository”
Fill in the info: URL: use HTTPS address
Create as a subdirectory of: Browse to where you would like to create this folder

6.1.1 The indicators

4 sets of indicators are used for this PoC:

Methicillin resistant Staph aureus and 3rd generation cephalosporin resistant E.coli - proportion of samples tested which are resistant
Flu vaccination coverage rates
Smoking rates in women
Injury admission rates

Fully specifying the indicators

The proportion of blood samples which grow either Staph aureaus or E. coli, which are tested for antibiotic sensitivity and which are found to be resistant to oxacllin or 3rd generation cephalosporins respectively for time period X to Y, stratified by [area] / [time period] / [age group] / [gender]
The proportion of the population which has been vaccinated against flu for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]
The rate of smoking in women aged 15+ / 18-44, per 100,000 female population for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]
The rate of hospitalisation for injury for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]

6.1.2 Pre-processing script

As a first step we will run a script which does a number of things.

Imports the datasets for each indicator
Loads KSA 2022 census data by age, gender and region¹
Makes variable names consistent
Recodes region names so that they match the names used in the Census 2022 data
Maps directorate names (used in smoking data) to region²
Ensures numeric variables (e.g. age) are converted to numbers
Creates a set of intermediate data tables
Saves file of reshaped population data,

The intermediate tables can be reused for further analysis and generating profiles.

The script can be run by typing

`source("~/proof-of-concept/scripts/pre-process.R")`

at the prompt in the console

This generates a set of objects in the R environment

Code

objects()

[1] "RETICULATE_PYTHON"

Objects called dfs reflect pre-processing of the original datasets. Objects containing names, contain the different variables used for age, gender and area variables across datasets. Those called sc are used to map locations of smoking clinics as part of recoding directorates to regions for the smoking data (see below).

The object regional_counts_complete is a data frame of regional age_band, gender specific counts for each indicator and forms the basis of indicator generation. Note, this includes region-age-band-gender combinations for which there is no data (because these combinations are not present in the original data - although they maybe in the full datasets).

Code

regional_counts_complete <- read_csv("data/regional_counts.csv")

regional_counts_complete |>
    head() |>
    flextable::flextable()

Region	age_band	sum_f_Smoking	sum_f_flu	sum_f_injury	sum_m_Smoking	sum_m_flu	sum_m_injury
`Asir	[0,5)		140	160		26	0
`Asir	[5,10)	0	7		1	1
`Asir	[10,15)	5	4		355	2
`Asir	[15,20)	39	13		2,744	4
`Asir	[20,25)		79			25
`Asir	[25,30)	1	76		21	22

Table 6.1: Region age-band specific counts by gender and indicator

6.1.3 Mapping health directorates to regions for smoking data

The smoking data provided is clinic based data disaggregated at the level of health directorate. There are 20 directorates in KSA, and 13 regions.

Available population data is at regional level, so to generate population denominators for the smoking data I have mapped directorates to regions as follows:

I used a dataset which contained spatial locations of healthcare facilities (smoking cessation clinics) at health directorate level³
For each location I extracted spatial coordinates (longitude and latitude)
I obtained a regional boundary file (shape file) from https://data.humdata.org/dataset/41ce9023-1d21-4549-a485-94316200aba0/resource/99834c81-ad34-415e-91c5-af053d8e55b4/download/sau_capp_adm1_1m_ocha.zip
I spatially joined the clinic location and boundary files (see
This created a lookup table for directorates and regions and enabled the smoking data to be recoded to regions and calculation of rates using the census regional population estimates

Code

sa_bound <- read_sf("data/ksa_bound.gml")
sc_ll_sf <- read_sf("data/smok_loc.gml")

sc_ll_sf |>
    ggplot() +
    geom_sf(data = sa_bound) +
    geom_sf(aes(colour = name))

Figure 6.1: Location map of smoking clinics mapped to KSA regional boundaries

Note. The census data was downloaded from https://tableau.saudicensus.sa/#/views/TA3-PopulationbydetailedAgebyRegionGovernorateNationalityandGenderAR_16850208449070/PopulationbydetailedAgebyRegionGovernorateNationalityandGenderARCSV.csv and variable and region names translated to English using ChatGPT4o↩︎
This uses the spatial locations of smoking clinics which include directorate names to map to KSA regional boundaries↩︎
https://www.moh.gov.sa/en/Ministry/Projects/TCP/Pages/default.aspx ↩︎

--- title: "Data pre-processing" format: html editor: markdown: wrap: 72 execute: cache: false message: false warning: false freeze: auto --- ```{r} #| label: setup #| include: false if(!require("needs"))install.packages("needs") library(needs) needs(tidyverse, data.table, sf) ``` ## Getting started These pages set out the process for loading, cleaning, reshaping and recoding provided datasets for calculating indicator values as a proof of concept, and creating a health profile from those indicators. To facilitate reproducibility the analyses are written in R code in the form of a notebook and R scripts which do much of the pre-processing. The full code is made available as a Github repository at <https://github.com/julianflowers/poc>. The files can be cloned (downloaded) - In RStudio, go to “File > New Project” - Click on “Version Control: - Checkout a project from a version control repository” - Click on “Git: - Clone a project from a repository” - Fill in the info: URL: use HTTPS address - Create as a subdirectory of: Browse to where you would like to create this folder ### The indicators 4 sets of indicators are used for this PoC: - Methicillin resistant Staph aureus and 3rd generation cephalosporin resistant E.coli - proportion of samples tested which are resistant - Flu vaccination coverage rates - Smoking rates in women - Injury admission rates Fully specifying the indicators 1. The proportion of blood samples which grow either Staph aureaus or E. coli, which are tested for antibiotic sensitivity and which are found to be resistant to oxacllin or 3rd generation cephalosporins respectively for time period X to Y, stratified by \[area\] / \[time period\] / \[age group\] / \[gender\] 2. The proportion of the population which has been vaccinated against flu for the time period X to Y stratified by \[area\] / \[time period\] / \[age group\] / \[gender\] 3. The rate of smoking in women aged 15+ / 18-44, per 100,000 female population for the time period X to Y stratified by \[area\] / \[time period\] / \[age group\] / \[gender\] 4. The rate of hospitalisation for injury for the time period X to Y stratified by \[area\] / \[time period\] / \[age group\] / \[gender\] ### Pre-processing script As a first step we will run a script which does a number of things. 1. Imports the datasets for each indicator 2. Loads KSA 2022 census data by age, gender and region[^getting-started-1] 3. Makes variable names consistent 4. Recodes `region` names so that they match the names used in the Census 2022 data 5. Maps directorate names (used in smoking data) to `region`[^getting-started-2] 6. Ensures numeric variables (e.g. age) are converted to numbers 7. Creates a set of intermediate data tables 8. Saves file of reshaped population data, [^getting-started-1]: Note. The census data was downloaded from <https://tableau.saudicensus.sa/#/views/TA3-PopulationbydetailedAgebyRegionGovernorateNationalityandGenderAR_16850208449070/PopulationbydetailedAgebyRegionGovernorateNationalityandGenderARCSV.csv> and variable and region names translated to English using ChatGPT4o [^getting-started-2]: This uses the spatial locations of smoking clinics which include directorate names to map to KSA regional boundaries The intermediate tables can be reused for further analysis and generating profiles. The script can be run by typing ``` `source("~/proof-of-concept/scripts/pre-process.R")` ``` at the prompt in the console This generates a set of objects in the R environment ```{r} #| label: environment objects objects() ``` Objects called `dfs` reflect pre-processing of the original datasets. Objects containing `names`, contain the different variables used for age, gender and area variables across datasets. Those called `sc` are used to map locations of smoking clinics as part of recoding directorates to regions for the smoking data (see below). The object `regional_counts_complete` is a data frame of regional age_band, gender specific counts for each indicator and forms the basis of indicator generation. Note, this includes region-age-band-gender combinations for which there is no data (because these combinations are not present in the original data - although they maybe in the full datasets). ```{r} #| label: tbl-counts #| tbl-cap-location: bottom #| tbl-cap: "Region age-band specific counts by gender and indicator" regional_counts_complete <- read_csv("data/regional_counts.csv") regional_counts_complete |> head() |> flextable::flextable() ``` ### Mapping health directorates to regions for smoking data The smoking data provided is clinic based data disaggregated at the level of health directorate. There are 20 directorates in KSA, and 13 regions. Available population data is at regional level, so to generate population denominators for the smoking data I have mapped directorates to regions as follows: - I used a dataset which contained spatial locations of healthcare facilities (smoking cessation clinics) at health directorate level[^getting-started-3] - For each location I extracted spatial coordinates (longitude and latitude) - I obtained a regional boundary file (shape file) from <https://data.humdata.org/dataset/41ce9023-1d21-4549-a485-94316200aba0/resource/99834c81-ad34-415e-91c5-af053d8e55b4/download/sau_capp_adm1_1m_ocha.zip> - I spatially joined the clinic location and boundary files (see @fig-map) - This created a lookup table for directorates and regions and enabled the smoking data to be recoded to regions and calculation of rates using the census regional population estimates [^getting-started-3]: [https://www.moh.gov.sa/en/Ministry/Projects/TCP/Pages/default.aspx](https://www.moh.gov.sa/en/Ministry/Projects/TCP/Pages/default.aspx") ```{r} #| label: fig-map #| fig-cap: "Location map of smoking clinics mapped to KSA regional boundaries" sa_bound <- read_sf("data/ksa_bound.gml") sc_ll_sf <- read_sf("data/smok_loc.gml") sc_ll_sf |> ggplot() + geom_sf(data = sa_bound) + geom_sf(aes(colour = name)) ```