Reproducible Data Science for Public Health - 18 📙 Encode categorical data

19.1 Overview

Most ML algorithms only work with numeric values In this notebook, we will present typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.

19.2 Learning objectives

seen two common strategies for encoding categorical features: ordinal encoding and one-hot encoding;
used a pipeline to use a one-hot encoder before fitting a logistic regression.

```{r}
library(dplyr)
library(skimr)
```

19.3 Identify categorical data

Let’s first load the entire adult dataset containing both numerical and categorical data.

```{r}
df <- openxlsx::read.xlsx("./data/dataset2.xlsx")
```

✏️ Exercise 1
R

Examine the structure of the data, including variable names, labels.

Tip

Stata: use the codebook command
R: use the skim function from the skimr package

```{r}
# Write your code here
```

```{r}
df %>%
  skimr::skim()
```

Data summary
Name	Piped data
Number of rows	10308
Number of columns	39
_______________________
Column type frequency:
character	5
numeric	34
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
CTX_district	0	1.0	5	9	3
CTX_facility_ID	0	1.0	5	5	18
CTX_area	0	1.0	5	5	2
CTX_facility_type	0	1.0	10	13	2
RX_free_text	9279	0.1	3	220	729

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
child_ID	0	1.00	5154.50	2975.81	1.0	2577.75	5154.5	7731.25	10308.0	▇▇▇▇▇
CTX_month	0	1.00	9.81	1.28	7.0	9.00	10.0	11.00	12.0	▆▇▇▇▃
CTX_day_of_week	0	1.00	1.88	1.47	0.0	0.00	2.0	3.00	6.0	▇▃▃▃▁
SDC_sex	4	1.00	1.49	0.50	1.0	1.00	1.0	2.00	2.0	▇▁▁▁▇
SDC_age_in_months	0	1.00	18.75	14.90	0.0	7.00	15.0	27.00	59.0	▇▆▃▂▁
CLIN_fever	0	1.00	0.84	3.74	0.0	0.00	1.0	1.00	98.0	▇▁▁▁▁
CLIN_fever_onset	3083	0.70	2.50	1.93	0.0	1.00	2.0	3.00	14.0	▇▅▁▁▁
CLIN_cough	0	1.00	0.69	3.75	0.0	0.00	1.0	1.00	98.0	▇▁▁▁▁
CLIN_diarrhoea	0	1.00	0.41	4.32	0.0	0.00	0.0	0.00	98.0	▇▁▁▁▁
MEAS_temperature	9271	0.10	37.08	0.98	34.5	36.50	37.0	37.50	42.5	▃▇▃▁▁
TEST_malaria_done	0	1.00	0.56	0.50	0.0	0.00	1.0	1.00	1.0	▆▁▁▁▇
TEST_malaria_type	4550	0.56	1.85	8.72	1.0	1.00	1.0	1.00	98.0	▇▁▁▁▁
TEST_malaria_result	4550	0.56	1.20	9.93	0.0	0.00	0.0	0.00	98.0	▇▁▁▁▁
DX_malaria	0	1.00	0.17	0.38	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▂
DX_malaria_severe	0	1.00	0.01	0.12	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
DX_count_non_malaria	0	1.00	0.84	0.71	0.0	0.00	1.0	1.00	5.0	▇▁▁▁▁
DX_severe	0	1.00	0.02	0.16	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_preconsult_antibiotics	0	1.00	0.17	0.37	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▂
RX_preconsult_antimalarials	0	1.00	0.04	0.20	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_antimalarials	0	1.00	0.13	0.33	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_antimalarial_parenteral	0	1.00	0.02	0.15	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_ACT	0	1.00	0.01	0.08	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_AL	0	1.00	0.12	0.32	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_artemether	0	1.00	0.01	0.08	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_artesunate	0	1.00	0.02	0.15	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_primaquine	0	1.00	0.00	0.05	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_quinine	0	1.00	0.01	0.08	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_antibiotics	0	1.00	0.53	0.50	0.0	0.00	1.0	1.00	1.0	▇▁▁▁▇
RX_antibiotics_src_text	0	1.00	0.01	0.10	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
RX_count_other	0	1.00	1.49	1.30	0.0	1.00	1.0	2.00	16.0	▇▁▁▁▁
MGMT_referral_src_caregiver	0	1.00	0.22	4.50	0.0	0.00	0.0	0.00	98.0	▇▁▁▁▁
MGMT_referral_src_registry	0	1.00	0.01	0.10	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
CTX_SPA_obs	0	1.00	0.06	0.23	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁
CTX_SPA_obs_after_lab	0	1.00	0.01	0.08	0.0	0.00	0.0	0.00	1.0	▇▁▁▁▁

19.4 Ordinal encoding

The most intuitive strategy is to encode each category with a different number. This means that there is an order in the resulting categories (e.g. 0 < 1 < 2). If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).

19.5 Encoding nominal categories (without assuming any order)

One-hot encoding is an alternative encoder that prevents the downstream models to make a false assumption about the ordering of categories. For a given feature, it will create as many new columns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to 1 while all the columns of the other categories will be set to 0.

We see that encoding a single feature will give a table full of zeros and ones. Sparse matrices are efficient data structures when most of your matrix elements are zero. If you want more details about them, you can look at this. (https://scipy-lectures.org/advanced/scipy_sparse/introduction.html#why-sparse-matrices)

19.6 Choosing an encoding strategy

Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).

In general One-Hot encoding is the encoding strategy used when the downstream models are linear models while ordinal encoding is often a good strategy with tree-based models. Using an OrdinalEncoder will output ordinal categories.The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not. One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use OneHotEncoder in such cases even if the original categories do not have a given order.

You can still use an ordinall encoding with linear models but you need to be sure that:

the original categories (before encoding) have an ordering;
the encoded categories follow the same ordering than the original categories.

18 :orange_book: Encode categorical data

19 Introduction

19.1 Overview

19.2 Learning objectives

19.3 Identify categorical data

19.4 Ordinal encoding

19.5 Encoding nominal categories (without assuming any order)

19.6 Choosing an encoding strategy