Cleaning Education 5W data from Venezuela

Updated 7 May 2024

Author

Sean Ng

Published

January 12, 2024

A. Cleaning the 5W data

Introduction

This page documents they typical cleaning process I would employ for programmatically handling data cleaning and reporting of 5W data in R. In this first section, we will be going through a sample cleaning script; the next section will deal with report generation.

We will be using data from the Education Cluster in Venezuela with partner names recoded. For another example of a cleaning script for 5Ws from Kenya, please see here.

Click here to jump to the next section on 5Ws reporting.

1. Reading in the data

As in typical, the data is in separate files, one from each partner and one for UNICEF itself, the cluster lead:

 [1] "SOCIO 1.xlsx"                                               
 [2] "SOCIO 10.xlsx"                                              
 [3] "SOCIO 2.xlsx"                                               
 [4] "SOCIO 3.xlsx"                                               
 [5] "SOCIO 4.xlsx"                                               
 [6] "SOCIO 5.xlsx"                                               
 [7] "SOCIO 6.xlsx"                                               
 [8] "SOCIO 7.xlsx"                                               
 [9] "SOCIO 8.xlsx"                                               
[10] "SOCIO 9.xlsx"                                               
[11] "UNICEF - Fondo de las Naciones Unidas para la Infancia.xlsx"

We start with a bit of organisation by setting the column types and names for the data that we are reading in. Even if we were working in Spanish, we would still need to clean up the column names as they can be overly long or can contain instructions.

# specifying column types 
df_col_types <- c("date", "date", "text", "text","text","text",
                   "text","text","text","text","text","text","text",
                   "numeric", "numeric", "text","text","text",
                   "text","text","text", "numeric", "text", 
                   "date", "date", "text",
                   "numeric", "numeric", "numeric", "numeric",
                   "numeric", "numeric", "numeric", "numeric",
                   "numeric", "numeric", "numeric", "numeric",
                   "numeric", "numeric", "numeric")

df_names <- c("date_reported",
               "month_reported",
               "project_code",
               "project_name",
               "lead_organisation",
               "implementing_partner",
               "state",
               "pcode1",
               "municipality",
               "pcode2",
               "parrish",
               "pcode3",
               "location",
               "latitude",
               "longitude",
               "activity_full",
               "activity_code",
               "activity_description",
               "activity_code_descrption",
               "activity_covid",
               "unit",
               "quantity_mensual",
               "recurrent_beneficiaries",
               "start_date",
               "end_date",
               "activity_status",
               "total_beneficiaries",
               "percent_indigenous",
               "percent_pwd",
               "m_0_3",
               "m_3_6",
               "m_7_12",
               "m_12_17",
               "m_18_19",
               "f_0_3",
               "f_3_6",
               "f_7_12",
               "f_12_17",
               "f_18_19",
               "f_19_and_older",
               "m_19_and_older")

Now we will write a function to read in the data and then consolidate all the files. The loop at the end combines all the cleaned files together into one dataframe.

# Function for reading in 5Ws files

read_5ws <- function(file){
  
  read_excel(file, 
             sheet = 1, 
             col_types = df_col_types)|>
    janitor::clean_names()|>
    setNames(df_names) |> 
    # Cleaning the date
    mutate(month_reported = ceiling_date(month_reported, "month") - days(1))
  
  }

## Path to 4Ws
# This is the folder where the 4Ws from the current round are stored. 
# Because we are using the `here` package, we can use the dot notation, 
# which greatly aids reproducibility

path_5ws <- "./data"

# Change the folder to the correct one
# The files will be read from here 
file_list <- list.files(path = path_5ws, 
                        recursive = TRUE, 
                        pattern = "\\.xlsx$", 
                        full.names = TRUE) 

# Filtering out open files?
file_list <- file_list[!str_detect(file_list, "\\~")]

# For loop to work, let's initialise an empty data frame
df_raw <- data.frame()

# Loop
for (i in 1:length(file_list)) {
  df_raw_i <- read_5ws(file_list[i])
  df_raw <- rbind(df_raw, df_raw_i)
}

Let us take a look at the consolidated file. From the output below, we can see that there are 12,054 rows and 41 columns in the dataset. The output below shows column names, column data types (in the angle brackets) and a preview of the data in each column.

df_raw |> 
  glimpse()

Rows: 12,054
Columns: 41
$ date_reported            <dttm> 2020-12-01, 2020-12-01, 2020-12-01, 2020-12-…
$ month_reported           <dttm> 2020-11-30, 2020-11-30, 2020-11-30, 2020-11-…
$ project_code             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ project_name             <chr> "Educación para todas y todos: continuidad ed…
$ lead_organisation        <chr> "UNICEF - Fondo de las Naciones Unidas para l…
$ implementing_partner     <chr> "SOCIO 1", "SOCIO 1", "SOCIO 1", "SOCIO 1", "…
$ state                    <chr> "BOLÍVAR", "BOLÍVAR", "BOLÍVAR", "BOLÍVAR", "…
$ pcode1                   <chr> "VE07", "VE07", "VE07", "VE07", "VE07", "VE07…
$ municipality             <chr> "CARONÍ", "CARONÍ", "CARONÍ", "CARONÍ", "CARO…
$ pcode2                   <chr> "VE0701", "VE0701", "VE0701", "VE0701", "VE07…
$ parrish                  <chr> "ONCE DE ABRIL", "ONCE DE ABRIL", "VISTA AL S…
$ pcode3                   <chr> "VE070104", "VE070104", "VE070108", "VE070108…
$ location                 <chr> "U.E.C. María Teresa Del Toro", "E.B.C FE Y A…
$ latitude                 <dbl> 8.383985, 8.377784, 8.355650, 8.343790, 8.331…
$ longitude                <dbl> -62.61942, -62.62054, -62.61467, -62.60385, -…
$ activity_full            <chr> "CLEDU/CA2.08: Educación a distancia", "CLEDU…
$ activity_code            <chr> "CA2.08", "CA2.08", "CA2.08", "CA2.08", "CA2.…
$ activity_description     <chr> "Educación a distancia", "Educación a distanc…
$ activity_code_descrption <chr> "CA2.08: Educación a distancia", "CA2.08: Edu…
$ activity_covid           <chr> "Sí", "Sí", "Sí", "Sí", "Sí", "Sí", "Sí", "Sí…
$ unit                     <chr> "#Personas", "#Personas", "#Personas", "#Pers…
$ quantity_mensual         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ recurrent_beneficiaries  <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ start_date               <dttm> 2020-11-02, 2020-11-02, 2020-11-02, 2020-11-…
$ end_date                 <dttm> 2020-11-30, 2020-11-30, 2020-11-30, 2020-11-…
$ activity_status          <chr> "En ejecución", "En ejecución", "En ejecución…
$ total_beneficiaries      <dbl> 203, 998, 962, 921, 180, 1178, 806, 913, 688,…
$ percent_indigenous       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ percent_pwd              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ m_0_3                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ m_3_6                    <dbl> 17, NA, 49, 44, 82, 113, 37, 43, 125, NA, NA,…
$ m_7_12                   <dbl> 83, 354, 289, 175, NA, 330, 198, 258, 265, 17…
$ m_12_17                  <dbl> NA, 169, 169, 231, NA, 150, 154, 174, NA, NA,…
$ m_18_19                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ f_0_3                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ f_3_6                    <dbl> 26, NA, 48, 47, 98, 99, 26, 38, 99, NA, NA, 7…
$ f_7_12                   <dbl> 77, 324, 267, 175, NA, 311, 211, 244, 199, 17…
$ f_12_17                  <dbl> NA, 151, 140, 249, NA, 175, 180, 156, NA, NA,…
$ f_18_19                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ f_19_and_older           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ m_19_and_older           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Setting the column types seems to have paid off well, especially since we do not seem to need to clean the date-time data.

2. String Manipulation and Recoding

2a. Recoding erroneous data

Errors are part of the 5W process, really any data entry process. However, R makes it easy to programmatically recode these errors. Below are some examples of the range of responses from the recurrent_beneficiaries column, which indicates whether beneficiaries have been reached before.

Recurrent beneficiaries responses
no	No	Si	sí	Sí
6	4,937	8	1	6,408

We will recode these as yes/no. We will do do something similar for other columns, such as this one:

Activity status responses
En ejecucion	en ejecución	en Ejecución	En ejecución	En Ejecución	finalizada	Finalizada
2	1	1	6,453	107	1	3,257

To resolve this, we will just recode the erroneous data and recode it. We have also cleaned out some of the NAs from the recurrent_beneficiaries and acitivity_covid columns:

df_raw |> 
  mutate(recurrent_beneficiaries = (recode(recurrent_beneficiaries, 
            "no" = "no", "No" = "no", 
            "Si" = "yes", "si" = "yes", "sí" = "yes", "Sí" = "yes")),  
           activity_covid = recode(activity_covid, 
            "no" = "no", "No" = "no", 
            "Si" = "yes", "si" = "yes", "sí" = "yes", "Sí" = "yes")) |>
  # recoding NAs
  replace_na(list(recurrent_beneficiaries = "no", activity_covid = "no")) |>
  mutate(activity_status = str_replace_all(activity_status, 
                                           c("En ejecucion" = "ejecucion", 
                                             "en ejecución" = "ejecucion",
                                             "en Ejecución" = "ejecucion",
                                             "En ejecución" = "ejecucion",
                                             "En Ejecución" = "ejecucion",
                                             "finalizada" = "finalizada",
                                             "Finalizada" = "finalizada")))

2b. Removing accents and standardising case

This dataset is from Venezuela and has numerous tildes and diacritical marks. However, their use amongst partners is not consistent. Although proper spelling in Spanish has diacritical marks, it is easiest to remove them all before working in code. Implementing partners also don’t really pay attention to this either and typically have submissions with a wide variety of spellings, especially for place names; we will also clean out the activity descriptions.

Let us first make a simple function to remove diacritical marks and force

# Function to remove accents
# stri_trans_general is from `stringi` and is typically used in 
# script-to-script conversions and transliterations 
# https://www.rdocumentation.org/packages/stringi/versions/1.8.3/topics/stri_trans_general
rm_accent <- function(x){
  x <- stri_trans_general(x, "Latin-ASCII")
}

# removing accents
# and str_to_lower() since R is case-sensitive
df_raw %>% 
  mutate(state    = rm_accent(str_to_lower(state)), 
         municipality = rm_accent(str_to_lower(municipality)),
         parrish = rm_accent(str_to_lower(parrish)),
         location = rm_accent(str_to_lower(location)),
         activity_description = rm_accent(str_to_lower(activity_description)))

2c. One function for reading and cleaning

We will now combine all this recoding into one function, organising our code better and improving readability. In dirtier datasets, sometimes partner names will need to be cleaned too (but that was done when this dataset was anonymised). To prevent data entry errors, we should just have been stricter about using dropdown tables.

In fact, let us put all these steps right before the reading and consolidation loop so that data is cleaned as it is read into the consolidated dataset. Let’s first amend the reading function:

# If you haven't run it yet
rm_accent <- function(colns){
  colns <- stri_trans_general(colns, "Latin-ASCII")
}

read_and_clean_5ws <- function(file){
  
  read_excel(file, 
             sheet = 1, 
             col_types = df_col_types)|>
    janitor::clean_names()|>
    setNames(df_names) |>
    mutate(recurrent_beneficiaries = (recode(recurrent_beneficiaries, 
            "no" = "no", "No" = "no", 
            "Si" = "yes", "si" = "yes", "sí" = "yes", "Sí" = "yes")),  
           activity_covid = recode(activity_covid, 
            "no" = "no", "No" = "no", 
            "Si" = "yes", "si" = "yes", "sí" = "yes", "Sí" = "yes")) |>
    # recoding NAs
    replace_na(list(recurrent_beneficiaries = "no", activity_covid = "no")) |>
    mutate(activity_status = str_replace_all(activity_status, 
                                             c("En ejecucion" = "ejecucion", 
                                               "en ejecución" = "ejecucion",
                                               "en Ejecución" = "ejecucion",
                                               "En ejecución" = "ejecucion",
                                               "En Ejecución" = "ejecucion",
                                               "finalizada" = "finalizada",
                                               "Finalizada" = "finalizada"))) |> 
    mutate(state    = rm_accent(str_to_lower(state)), 
           municipality = rm_accent(str_to_lower(municipality)),
           parrish = rm_accent(str_to_lower(parrish)),
           location = rm_accent(str_to_lower(location)),
           activity_description = rm_accent(str_to_lower(activity_description))) |> 
    # Cleaning the date
    mutate(month_reported = ceiling_date(month_reported, "month") - days(1))
  
  }

path_5ws <- "./data"

# Change the folder to the correct one
# The files will be read from here 
file_list <- list.files(path = path_5ws, 
                        recursive = TRUE, 
                        pattern = "\\.xlsx$", 
                        full.names = TRUE) 

# Filtering out open files?
file_list <- file_list[!str_detect(file_list, "\\~")]

# Re-initialising the empty dataframe
df_raw <- data.frame()

# Let us repeat the loop, with the new reading and cleaning function
for (i in 1:length(file_list)) {
  df_raw_i <- read_and_clean_5ws(file_list[i])
  df_raw <- rbind(df_raw, df_raw_i)
}

3. Cleaning locations

Now that we have a consolidated, semi-clean dataset, let us now clean geographic and location data.

We will first read in locations.csv. This file comes from the official gazette of administrative areas and is typically available on HDX. The locations.csv dataset also has schools and schools codes, in addition to state, municipality and parrish official names and pcodes, given that I was using it for the education cluster.

locations <- read_csv("locations.csv", show_col_types = FALSE) %>%
  # Using the rm_accent() function from earlier
  mutate(state  = rm_accent(str_to_lower(estado)), # just to make sure 
         municipality = rm_accent(str_to_lower(municipio)),
         parrish = rm_accent(str_to_lower(parroquia)),
         location = rm_accent(str_to_lower(ubicacion))) |> 
  select(-estado, -municipio, -parroquia, -ubicacion)

We will also use an anti-join to identify new project sites which we should add to the location database. We won’t be doing this in this exercise, but updating the project site database is important for verification, monitoring and accountability to affected populations.

# see if you need to update locations 
# new locations not in locations.csv will be caught by the anti-join()
 locations_add <- df_raw %>% 
   select(state, pcode1, municipality, pcode2, parrish, pcode3, location,
          latitude, longitude) %>% 
   distinct() %>% 
   anti_join(locations, by = "location")

3b. Splitting the dataset into clean and dirty

Let us identify the incomplete locations. We can see from the output that there are 557 rows with missing geographic data.

df_raw |> 
  filter(is.na(state) | is.na(pcode1) |
         is.na(municipality) | is.na(pcode2) |
         is.na(parrish) | is.na(pcode3)) |> 
  nrow()

[1] 557

We will split off these rows to be cleaned. We will combine it back together later.

# splitting into two datasets, one clean one dirty. 
locations_dirty <- df_raw %>% 
  filter(is.na(state) | is.na(pcode1) |
         is.na(municipality) | is.na(pcode2) |
         is.na(parrish) | is.na(pcode3))

locations_clean <- df_raw %>% 
  filter(!is.na(state) & !is.na(pcode1) &
         !is.na(municipality) & !is.na(pcode2) &
         !is.na(parrish) & !is.na(pcode3))

3c. Fill in missing values with `coalesce()`

The locations.csv reference dataset is used to clean locations_dirty and rewrite it. The distinct() call at the end is to ensure that there are no duplicates. After which, we combine locations_clean and locations_dirty into df_raw_locations.

# filling in missing values
locations_dirty <- locations_dirty %>% 
  left_join(locations, by = "location") %>% 
  mutate(state    = coalesce(state.x, state.y),
         pcode1    = coalesce(pcode1.x, pcode1.y),
         municipality = coalesce(municipality.x, municipality.y),
         pcode2    = coalesce(pcode2.x, pcode2.y),
         parrish = coalesce(parrish.x, parrish.y),
         pcode3    = coalesce(pcode3.x, pcode3.y)) %>% 
  select(-state.x, -state.y, -pcode1.x, -pcode1.y,
         -municipality.x, -municipality.y, -pcode2.x, -pcode2.y,
         -parrish.x, -parrish.y, -pcode3.x, -pcode3.y) %>% 
  distinct() # removing duplicate rows from the join 

# bind_rows() does not care about column sequence 
# but put adm_clean first so that the original order is preserved
df_raw_locations <- bind_rows(locations_clean, locations_dirty)

3d. Error checking the join

Printing out the sum of the total_beneficiaries and the number of rows in df_raw(original dataset) and df_raw_locations(updated, with the locations cleaned).

# error checking
rbind(sum(df_raw$total_beneficiaries, na.rm = TRUE),
      sum(df_raw_locations$total_beneficiaries, na.rm = TRUE),
      nrow(df_raw),
      nrow(df_raw_locations))

         [,1]
[1,] 28965878
[2,] 28965878
[3,]    12054
[4,]    12054

Everything seems in order, and we have not duplicated or dropped any rows from the original dataset.

4. Beneficiary disaggregation

Let us next check on the beneficiary disaggregations by comparing the sum of all disaggregated beneficiaries against the reported total. We will mutate two new columns: total_disagg_ben, which is the sum of all sex and age-disaggregated beneficiaries and no_disagg_ben for beneficiaries reported without sex and age information. Depending on reporting standards of each cluster, we will have different uses for the disaggregated and non-disaggregated numbers.

# mutating bencheck and unspecified beneficiaries columns 
df_raw_locations <- df_raw_locations %>% 
  mutate(total_disagg_ben = select(., m_0_3:m_19_and_older) %>%  rowSums(na.rm = TRUE), 
         no_disagg_ben = round(total_beneficiaries) - round(total_disagg_ben))

Let us print the rows where the sum of disaggregated beneficiaries (total_disagg_ben) does not match the totals reported in the total_beneficiaries column, so that we can return to the relevant parties for follow up.

Undisaggregated beneficiaries
implementing_partner	activity_description	location	total_disagg_ben	total_beneficiaries	no_disagg_ben
SOCIO 2	promocion mensajes claves para la comunidad escolar	u.e.c. divina pastora (catedral)	0	244	244
SOCIO 4	iniciativas para reinsercion educativa de nna fuera de la escuela	u.e. san jose de cotiza	18	26	8
SOCIO 6	formacion docente y otro personal educativo	ministerio del poder popular para la educacion	0	398	398
SOCIO 6	formacion docente y otro personal educativo	cluster de educacion	0	146	146

In this scenario, we would have returned to these three partners and asked for corrections.

5. Outputs

5a. Cleaned wide version

We can now share a clean version of the wide file with OCHA, they will generally prefer a wide table, in the same shape as the 5W form:

# turn on eval if you want the file
write_csv(df_raw_locations, paste0(format(Sys.time(), "%y%m%d_%H%M_"), "5W_wide.csv"))

5b. Unique beneficiaries

This section pivots the dataset longer into a tidy format – normal 5W templates tend to “wide”, with the totals and beneficiary disaggregations all in one row. We use a long format to make things easier for plotting and analysis.

We will also create the dataset unique_ben to aid in our estimations of the actual reach of the programme. This is distinct from the concept of a beneficiary frequency, which is one person reached by one activity and does not account for double counting.

Unique beneficiaries are the individuals who have been reached by at least one programme activity. Frequencies are important for budgetary reasons and to show the full scale of the programme. Unique beneficiaries show a more accurate picture of the programme’s footprint and are very useful for coordinating activities with other clusters.

The gold standard for avoiding double counting is the deployment of a beneficiary database. However, most agencies are not able to do this. In lieu of that, to avoid double counting and estimate the number of unique beneficiaries, we will only add the highest number of beneficiaries reached at each project site (location).

At this point, we will also filter out the activity CA1.05, which the promotion of key messages for the education community, specifically, we will filter out all activities where the location is “radio” or “fm station”. Mass messaging activities do not typically count towards beneficiary totals as they are quite light-touch, especially in comparison to education kits, school feeding or work with out-of-school children. If need be, we can always analyse the radio messaging activity on its own.

Ideally, we would codify all this by developing indicator guidance for the cluster, though it did take a few years for the office to systematise.

# pivoting longer and creating new dataframe
unique_ben <- df_raw_locations |> 
  filter(!str_detect(location, "radio|fm")) |> 
  select(-total_disagg_ben) |> 
  pivot_longer(m_0_3:no_disagg_ben, 
               names_to = "disaggregation", values_to = "beneficiaries") |>  
  filter(recurrent_beneficiaries == "no") |> 
  filter(beneficiaries != 0) %>% # empty cells are 0 in the 5W table
  group_by(location, disaggregation)|> 
  slice(which.max(beneficiaries)) |>  
  ungroup() |>  
  select(month_reported, state, pcode1, municipality, pcode2,
       parrish, pcode3, location, latitude, longitude, 
       disaggregation, beneficiaries)

# writing csv of unique_ben with datestamp
# turn on eval if you want the file 
write_csv(unique_ben, paste0(format(Sys.time(), "%y%m%d_%H%M_"), "unique_ben.csv"))
# Without date stamp
write_csv(unique_ben, paste0("unique_ben.csv"))

5c. Beneficiaries by activity

Let us also make the dataset activity_ben for activity-wise analysis. As with unique_ben, recurrent beneficiaries have been filtered out. Similar to unique_ben, this dataset has also been pivoted longer into a tidy format.

Please note that there is double counting in this dataset as beneficiaries might have been reached by more than one activity. For geographic analysis, please use unique_ben.

# This is the dataset for beneficiaries
activity_ben <- df_raw_locations |> 
  filter(!str_detect(location, "radio|fm")) |> 
  select(-total_disagg_ben) |> 
  pivot_longer(m_0_3:no_disagg_ben, 
               names_to = "disaggregation", values_to = "beneficiaries") |>  
  filter(recurrent_beneficiaries == "no") |> 
  filter(beneficiaries != 0) %>% # empty cells are 0 in the 5W table
  select(month_reported, project_name, implementing_partner, 
         state, pcode1, municipality, pcode2, 
         parrish, pcode3, location, latitude, longitude, 
         activity_code, activity_description, activity_covid,
         disaggregation, beneficiaries)

# writing csv activity_ben with datestamp
# turn on eval if you want the file 

write_csv(activity_ben, paste0(format(Sys.time(), "%y%m%d_%H%M_"), "activity_ben.csv"))
# Without date stamp
write_csv(activity_ben, paste0("activity_ben.csv"))

5d. Comparing `unique_ben`, `activity_ben` and difference

Printed below are the numbers of unique beneficiaries, beneficiary frequencies and the difference between the two figures.

          [,1]
[1,]  728408.4
[2,] 1071144.0
[3,] -342735.6

Click through to the next part, where will go through 5W reporting and make use the cleaned datasets.