6  Data pre-processing

6.1 Getting started

These pages set out the process for loading, cleaning, reshaping and recoding provided datasets for calculating indicator values as a proof of concept, and creating a health profile from those indicators.

To facilitate reproducibility the analyses are written in R code in the form of a notebook and R scripts which do much of the pre-processing. The full code is made available as a Github repository at https://github.com/julianflowers/poc. The files can be cloned (downloaded)

  • In RStudio, go to “File > New Project”

  • Click on “Version Control:

  • Checkout a project from a version control repository”

  • Click on “Git:

  • Clone a project from a repository”

  • Fill in the info: URL: use HTTPS address

  • Create as a subdirectory of: Browse to where you would like to create this folder

6.1.1 The indicators

4 sets of indicators are used for this PoC:

  • Methicillin resistant Staph aureus and 3rd generation cephalosporin resistant E.coli - proportion of samples tested which are resistant

  • Flu vaccination coverage rates

  • Smoking rates in women

  • Injury admission rates

Fully specifying the indicators

  1. The proportion of blood samples which grow either Staph aureaus or E. coli, which are tested for antibiotic sensitivity and which are found to be resistant to oxacllin or 3rd generation cephalosporins respectively for time period X to Y, stratified by [area] / [time period] / [age group] / [gender]

  2. The proportion of the population which has been vaccinated against flu for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]

  3. The rate of smoking in women aged 15+ / 18-44, per 100,000 female population for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]

  4. The rate of hospitalisation for injury for the time period X to Y stratified by [area] / [time period] / [age group] / [gender]

6.1.2 Pre-processing script

As a first step we will run a script which does a number of things.

  1. Imports the datasets for each indicator

  2. Loads KSA 2022 census data by age, gender and region1

  3. Makes variable names consistent

  4. Recodes region names so that they match the names used in the Census 2022 data

  5. Maps directorate names (used in smoking data) to region2

  6. Ensures numeric variables (e.g. age) are converted to numbers

  7. Creates a set of intermediate data tables

  8. Saves file of reshaped population data,

The intermediate tables can be reused for further analysis and generating profiles.

The script can be run by typing

`source("~/proof-of-concept/scripts/pre-process.R")`

at the prompt in the console

This generates a set of objects in the R environment

Code
objects() 
[1] "RETICULATE_PYTHON"

Objects called dfs reflect pre-processing of the original datasets. Objects containing names, contain the different variables used for age, gender and area variables across datasets. Those called sc are used to map locations of smoking clinics as part of recoding directorates to regions for the smoking data (see below).

The object regional_counts_complete is a data frame of regional age_band, gender specific counts for each indicator and forms the basis of indicator generation. Note, this includes region-age-band-gender combinations for which there is no data (because these combinations are not present in the original data - although they maybe in the full datasets).

Code
regional_counts_complete <- read_csv("data/regional_counts.csv")

regional_counts_complete |>
    head() |>
    flextable::flextable()

Region

age_band

sum_f_Smoking

sum_f_flu

sum_f_injury

sum_m_Smoking

sum_m_flu

sum_m_injury

`Asir

[0,5)

140

160

26

0

`Asir

[5,10)

0

7

1

1

`Asir

[10,15)

5

4

355

2

`Asir

[15,20)

39

13

2,744

4

`Asir

[20,25)

79

25

`Asir

[25,30)

1

76

21

22

Table 6.1: Region age-band specific counts by gender and indicator

6.1.3 Mapping health directorates to regions for smoking data

The smoking data provided is clinic based data disaggregated at the level of health directorate. There are 20 directorates in KSA, and 13 regions.

Available population data is at regional level, so to generate population denominators for the smoking data I have mapped directorates to regions as follows:

Code
sa_bound <- read_sf("data/ksa_bound.gml")
sc_ll_sf <- read_sf("data/smok_loc.gml")

sc_ll_sf |>
    ggplot() +
    geom_sf(data = sa_bound) +
    geom_sf(aes(colour = name)) 
Figure 6.1: Location map of smoking clinics mapped to KSA regional boundaries

  1. Note. The census data was downloaded from https://tableau.saudicensus.sa/#/views/TA3-PopulationbydetailedAgebyRegionGovernorateNationalityandGenderAR_16850208449070/PopulationbydetailedAgebyRegionGovernorateNationalityandGenderARCSV.csv and variable and region names translated to English using ChatGPT4o↩︎

  2. This uses the spatial locations of smoking clinics which include directorate names to map to KSA regional boundaries↩︎

  3. https://www.moh.gov.sa/en/Ministry/Projects/TCP/Pages/default.aspx↩︎