```{r}
library(dplyr)
library(skimr)
```
18 :orange_book: Encode categorical data
19 Introduction
19.1 Overview
Most ML algorithms only work with numeric values In this notebook, we will present typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.
19.2 Learning objectives
- seen two common strategies for encoding categorical features: ordinal encoding and one-hot encoding;
- used a pipeline to use a one-hot encoder before fitting a logistic regression.
19.3 Identify categorical data
Letβs first load the entire adult dataset containing both numerical and categorical data.
```{r}
<- openxlsx::read.xlsx("./data/dataset2.xlsx")
df ```
Examine the structure of the data, including variable names, labels.
```{r}
# Write your code here
```
```{r}
%>%
df ::skim()
skimr```
Name | Piped data |
Number of rows | 10308 |
Number of columns | 39 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 34 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
CTX_district | 0 | 1.0 | 5 | 9 | 0 | 3 | 0 |
CTX_facility_ID | 0 | 1.0 | 5 | 5 | 0 | 18 | 0 |
CTX_area | 0 | 1.0 | 5 | 5 | 0 | 2 | 0 |
CTX_facility_type | 0 | 1.0 | 10 | 13 | 0 | 2 | 0 |
RX_free_text | 9279 | 0.1 | 3 | 220 | 0 | 729 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
child_ID | 0 | 1.00 | 5154.50 | 2975.81 | 1.0 | 2577.75 | 5154.5 | 7731.25 | 10308.0 | βββββ |
CTX_month | 0 | 1.00 | 9.81 | 1.28 | 7.0 | 9.00 | 10.0 | 11.00 | 12.0 | βββββ |
CTX_day_of_week | 0 | 1.00 | 1.88 | 1.47 | 0.0 | 0.00 | 2.0 | 3.00 | 6.0 | βββββ |
SDC_sex | 4 | 1.00 | 1.49 | 0.50 | 1.0 | 1.00 | 1.0 | 2.00 | 2.0 | βββββ |
SDC_age_in_months | 0 | 1.00 | 18.75 | 14.90 | 0.0 | 7.00 | 15.0 | 27.00 | 59.0 | βββββ |
CLIN_fever | 0 | 1.00 | 0.84 | 3.74 | 0.0 | 0.00 | 1.0 | 1.00 | 98.0 | βββββ |
CLIN_fever_onset | 3083 | 0.70 | 2.50 | 1.93 | 0.0 | 1.00 | 2.0 | 3.00 | 14.0 | ββ βββ |
CLIN_cough | 0 | 1.00 | 0.69 | 3.75 | 0.0 | 0.00 | 1.0 | 1.00 | 98.0 | βββββ |
CLIN_diarrhoea | 0 | 1.00 | 0.41 | 4.32 | 0.0 | 0.00 | 0.0 | 0.00 | 98.0 | βββββ |
MEAS_temperature | 9271 | 0.10 | 37.08 | 0.98 | 34.5 | 36.50 | 37.0 | 37.50 | 42.5 | βββββ |
TEST_malaria_done | 0 | 1.00 | 0.56 | 0.50 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 | βββββ |
TEST_malaria_type | 4550 | 0.56 | 1.85 | 8.72 | 1.0 | 1.00 | 1.0 | 1.00 | 98.0 | βββββ |
TEST_malaria_result | 4550 | 0.56 | 1.20 | 9.93 | 0.0 | 0.00 | 0.0 | 0.00 | 98.0 | βββββ |
DX_malaria | 0 | 1.00 | 0.17 | 0.38 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
DX_malaria_severe | 0 | 1.00 | 0.01 | 0.12 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
DX_count_non_malaria | 0 | 1.00 | 0.84 | 0.71 | 0.0 | 0.00 | 1.0 | 1.00 | 5.0 | βββββ |
DX_severe | 0 | 1.00 | 0.02 | 0.16 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_preconsult_antibiotics | 0 | 1.00 | 0.17 | 0.37 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_preconsult_antimalarials | 0 | 1.00 | 0.04 | 0.20 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_antimalarials | 0 | 1.00 | 0.13 | 0.33 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_antimalarial_parenteral | 0 | 1.00 | 0.02 | 0.15 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_ACT | 0 | 1.00 | 0.01 | 0.08 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_AL | 0 | 1.00 | 0.12 | 0.32 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_artemether | 0 | 1.00 | 0.01 | 0.08 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_artesunate | 0 | 1.00 | 0.02 | 0.15 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_primaquine | 0 | 1.00 | 0.00 | 0.05 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_quinine | 0 | 1.00 | 0.01 | 0.08 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_antibiotics | 0 | 1.00 | 0.53 | 0.50 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 | βββββ |
RX_antibiotics_src_text | 0 | 1.00 | 0.01 | 0.10 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
RX_count_other | 0 | 1.00 | 1.49 | 1.30 | 0.0 | 1.00 | 1.0 | 2.00 | 16.0 | βββββ |
MGMT_referral_src_caregiver | 0 | 1.00 | 0.22 | 4.50 | 0.0 | 0.00 | 0.0 | 0.00 | 98.0 | βββββ |
MGMT_referral_src_registry | 0 | 1.00 | 0.01 | 0.10 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
CTX_SPA_obs | 0 | 1.00 | 0.06 | 0.23 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
CTX_SPA_obs_after_lab | 0 | 1.00 | 0.01 | 0.08 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 | βββββ |
19.4 Ordinal encoding
The most intuitive strategy is to encode each category with a different number. This means that there is an order in the resulting categories (e.g. 0 < 1 < 2). If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).
19.5 Encoding nominal categories (without assuming any order)
One-hot encoding is an alternative encoder that prevents the downstream models to make a false assumption about the ordering of categories. For a given feature, it will create as many new columns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to 1 while all the columns of the other categories will be set to 0.
We see that encoding a single feature will give a table full of zeros and ones. Sparse matrices are efficient data structures when most of your matrix elements are zero. If you want more details about them, you can look at this. (https://scipy-lectures.org/advanced/scipy_sparse/introduction.html#why-sparse-matrices)
19.6 Choosing an encoding strategy
Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).
In general One-Hot encoding is the encoding strategy used when the downstream models are linear models while ordinal encoding is often a good strategy with tree-based models. Using an OrdinalEncoder will output ordinal categories.The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not. One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use OneHotEncoder in such cases even if the original categories do not have a given order.
You can still use an ordinall encoding with linear models but you need to be sure that:
- the original categories (before encoding) have an ordering;
- the encoded categories follow the same ordering than the original categories.