Reproducible Data Science for Public Health - 14 📘 Import external data

14.1 Introduction

14.2 Import data from files

14.3 Import data from URLs

14.2.1 Import Excel data

14.2.2 Import CSV data

14.2.3 Import Stata data

Most of the time you will want to generate Quarto documents using your own data. To this aim, you will have to import data from external sources such as files, URLs, or server data (e.g., ODK Central data). There is a dedicated importing function in R and Python for almost every data format. In this section we show you how to import Stata (.dta), Excel (.xlsx) and comma-separated values (CSV, .csv) data formats from files and from URLs, as well as how to import ODK data directly from an ODK Central server.

Important

As you import data to further process / analyse them, you have to store the imported data in a data frame.

All Python and R functions only require as input the path where the file you want to import is stored. This path has to be passed as a sequence of characters (character) within double (““) or single (’’) quotes.

The path can be either:

relative to your Quarto document

```{r}
relative_path <- "./data/mydata.csv"
```

absolute

```{r}
absolute_path <- "C:/Users/myuser/Documents/mydata.csv"
```

Functions in general have additional optional arguments.

Import the Excel data set dataset1.xlsx and store it into a data frame called df1.

Tip

Stata: use the import excel Stata command with the stata function from the RStata package.
R: use the read.xlsx function from the openxlsx package.
Python: use the read_excel function from the pandas package.

```{r}
# Write your code here
```

Use the import excel Stata command with the stata function from the RStata package.

```{r}
#| df-print: kable
library(RStata)

df1_stata <- RStata::stata("import excel ./data/dataset1a.xlsx",
                     data.out = TRUE)
```

Warning in readLines(rdl): Zeile 1 scheint ein nul Zeichen zu enthalten

```{r}
#| df-print: kable
df1_stata_sub <- head(df1_stata, 5)
knitr::kable(df1_stata_sub)
```

. import excel ./data/dataset1a.xlsx
(7 vars, 10,309 obs)

Use the read.xlsx function from the openxlsx package.

```{r}
library(openxlsx)

df1_r <- openxlsx::read.xlsx("./data/dataset1a.xlsx")
df1_r_sub <- head(df1_r, 5)
knitr::kable(df1_r_sub)
```

child_ID	SDC_sex	SDC_age_in_months	CLIN_fever	CLIN_fever_onset	CLIN_cough	CLIN_diarrhoea
1	2	10	0	NA	1	1
2	2	6	0	NA	0	1
3	1	6	0	NA	0	0
4	1	11	1	3	1	1
5	2	21	1	2	0	0

Use the read_excel function from the pandas package.

```{python}
import pandas

df1_python = pandas.read_excel('./data/dataset1a.xlsx')
df1_python_sub = df1_python.head(5)
```

C:\Users\langhe\AppData\Local\R\win-library\4.3\reticulate\python\rpytools\loader.py:117: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  return _find_and_load(name, import_)

library(reticulate)

Warning: Paket 'reticulate' wurde unter R Version 4.3.3 erstellt

knitr::kable(py$df1_python_sub)

child_ID	SDC_sex	SDC_age_in_months	CLIN_fever	CLIN_fever_onset	CLIN_cough	CLIN_diarrhoea
1	2	10	0	NaN	1	1
2	2	6	0	NaN	0	1
3	1	6	0	NaN	0	0
4	1	11	1	3	1	1
5	2	21	1	2	0	0

Read the CSV data set dataset1.csv and store it into a data frame called df2.

Tip

Stata: use the import delimited Stata command with the stata function from the RStata package.
R: use the read.csv function from the haven package.
Python: use the read_csv function from the pandas package.

```{r}
# Write your code here
```

```{r}
library(RStata)

df2 <- RStata::stata("import delimited ./data/dataset1b.csv",
                     data.out = TRUE)
```

. import delimited ./data/dataset1b.csv
(encoding automatically selected: ISO-8859-1)
(7 vars, 10,308 obs)

```{r}
df2_r <- read.csv("./data/dataset1b.csv")
df2_r_sub <- head(df2_r, 5)
knitr::kable(df2_r_sub)
```

child_ID	SDC_sex	SDC_age_in_months	CLIN_fever	CLIN_fever_onset	CLIN_cough	CLIN_diarrhoea
1	2	10	0	NA	1	1
2	2	6	0	NA	0	1
3	1	6	0	NA	0	0
4	1	11	1	3	1	1
5	2	21	1	2	0	0

Use the read_csv function from the pandas package.

```{python}
import pandas

df2 = pandas.read_csv('./data/dataset1b.csv')
```

Read the Stata data set dataset1.dta and store it into a data frame called df3.

Tip

Stata: use the use Stata command with the stata function from the RStata package.
R: use the read_dta function from the haven package. This package supports SAS, STATA and SPSS software.
Python: use the read_stata function from the pandas package.

```{r}
# Write your code here
```

```{r}
library(RStata)

df3 <- RStata::stata("use ./data/dataset1c.dta",
                     data.out = TRUE)
```

. use ./data/dataset1c.dta

```{r}
library(haven)

df3 <- haven::read_dta("./data/dataset1c.dta")
```

Use the read_stata function from the pandas package.

```{python}
import pandas

df3 = pandas.read_stata('./data/dataset1c.dta')
```

SDC_age_in_months

CLIN_fever

CLIN_fever_onset

CLIN_cough

CLIN_diarrhoea

All functions can accept URLs as well instead of the path to a specific file.

Import the CSV data set that contains a comprehensive spatial inventory of 98,745 public health facilities in Sub Saharan Africa directly from the following url and store it into a data frame called df4.

To learn more about how this data set was assembled, please refer to (1)

Tip

Stata: use the import excel Stata command with the stata function from the RStata package.
R: use the read.xlsx function from the openxlsx package.
Python: use the read_excel function from the pandas package. In the latest version of pandas (0.19.2) you can directly pass the url

```{r}
# Write your code here
```

Here, because quotes are already used for the Stata command, you need to use the other type of quotes for indicating the URL.

```{r}
library(RStata)

#df2 <- RStata::stata('import delimited "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"',
                     # data.out = TRUE)
```

Use the read.xlsx function from the openxlsx package.

```{r}
#csv_url <- "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"
#df4 <- read.csv(csv_url)
```

Use the read_excel function from the pandas package.

```{python}
import pandas

#csv_url = "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"
#df1 = pandas.read_csv(csv_url)
```