14  :blue_book: Import external data

14.1 Introduction

Most of the time you will want to generate Quarto documents using your own data. To this aim, you will have to import data from external sources such as files, URLs, or server data (e.g., ODK Central data). There is a dedicated importing function in R and Python for almost every data format. In this section we show you how to import Stata (.dta), Excel (.xlsx) and comma-separated values (CSV, .csv) data formats from files and from URLs, as well as how to import ODK data directly from an ODK Central server.

Important

As you import data to further process / analyse them, you have to store the imported data in a data frame.

14.2 Import data from files

All Python and R functions only require as input the path where the file you want to import is stored. This path has to be passed as a sequence of characters (character) within double (““) or single (’’) quotes.

The path can be either:

  • relative to your Quarto document
```{r}
relative_path <- "./data/mydata.csv"
```
  • absolute
```{r}
absolute_path <- "C:/Users/myuser/Documents/mydata.csv"
```

Functions in general have additional optional arguments.

14.2.1 Import Excel data

Import the Excel data set dataset1.xlsx and store it into a data frame called df1.

Tip
  • Stata: use the import excel Stata command with the stata function from the RStata package.
  • R: use the read.xlsx function from the openxlsx package.
  • Python: use the read_excel function from the pandas package.
```{r}
# Write your code here
```

Use the import excel Stata command with the stata function from the RStata package.

```{r}
#| df-print: kable
library(RStata)

df1_stata <- RStata::stata("import excel ./data/dataset1a.xlsx",
                     data.out = TRUE)
```
Warning in readLines(rdl): Zeile 1 scheint ein nul Zeichen zu enthalten
```{r}
#| df-print: kable
df1_stata_sub <- head(df1_stata, 5)
knitr::kable(df1_stata_sub)
```
. import excel ./data/dataset1a.xlsx
(7 vars, 10,309 obs)
A B C D E F G
child_ID SDC_sex SDC_age_in_months CLIN_fever CLIN_fever_onset CLIN_cough CLIN_diarrhoea
1 2 10 0 1 1
2 2 6 0 0 1
3 1 6 0 0 0
4 1 11 1 3 1 1

Use the read.xlsx function from the openxlsx package.

```{r}
library(openxlsx)

df1_r <- openxlsx::read.xlsx("./data/dataset1a.xlsx")
df1_r_sub <- head(df1_r, 5)
knitr::kable(df1_r_sub)
```
child_ID SDC_sex SDC_age_in_months CLIN_fever CLIN_fever_onset CLIN_cough CLIN_diarrhoea
1 2 10 0 NA 1 1
2 2 6 0 NA 0 1
3 1 6 0 NA 0 0
4 1 11 1 3 1 1
5 2 21 1 2 0 0

Use the read_excel function from the pandas package.

```{python}
import pandas

df1_python = pandas.read_excel('./data/dataset1a.xlsx')
df1_python_sub = df1_python.head(5)
```
C:\Users\langhe\AppData\Local\R\win-library\4.3\reticulate\python\rpytools\loader.py:117: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  return _find_and_load(name, import_)
library(reticulate)
Warning: Paket 'reticulate' wurde unter R Version 4.3.3 erstellt
knitr::kable(py$df1_python_sub)
child_ID SDC_sex SDC_age_in_months CLIN_fever CLIN_fever_onset CLIN_cough CLIN_diarrhoea
1 2 10 0 NaN 1 1
2 2 6 0 NaN 0 1
3 1 6 0 NaN 0 0
4 1 11 1 3 1 1
5 2 21 1 2 0 0

14.2.2 Import CSV data

Read the CSV data set dataset1.csv and store it into a data frame called df2.

Tip
  • Stata: use the import delimited Stata command with the stata function from the RStata package.
  • R: use the read.csv function from the haven package.
  • Python: use the read_csv function from the pandas package.
```{r}
# Write your code here
```
```{r}
library(RStata)

df2 <- RStata::stata("import delimited ./data/dataset1b.csv",
                     data.out = TRUE)
```
. import delimited ./data/dataset1b.csv
(encoding automatically selected: ISO-8859-1)
(7 vars, 10,308 obs)
```{r}
df2_r <- read.csv("./data/dataset1b.csv")
df2_r_sub <- head(df2_r, 5)
knitr::kable(df2_r_sub)
```
child_ID SDC_sex SDC_age_in_months CLIN_fever CLIN_fever_onset CLIN_cough CLIN_diarrhoea
1 2 10 0 NA 1 1
2 2 6 0 NA 0 1
3 1 6 0 NA 0 0
4 1 11 1 3 1 1
5 2 21 1 2 0 0

Use the read_csv function from the pandas package.

```{python}
import pandas

df2 = pandas.read_csv('./data/dataset1b.csv')
```

14.2.3 Import Stata data

Read the Stata data set dataset1.dta and store it into a data frame called df3.

Tip
  • Stata: use the use Stata command with the stata function from the RStata package.
  • R: use the read_dta function from the haven package. This package supports SAS, STATA and SPSS software.
  • Python: use the read_stata function from the pandas package.
```{r}
# Write your code here
```
```{r}
library(RStata)

df3 <- RStata::stata("use ./data/dataset1c.dta",
                     data.out = TRUE)
```
. use ./data/dataset1c.dta
```{r}
library(haven)

df3 <- haven::read_dta("./data/dataset1c.dta")
```

Use the read_stata function from the pandas package.

```{python}
import pandas

df3 = pandas.read_stata('./data/dataset1c.dta')
```

14.3 Import data from URLs

All functions can accept URLs as well instead of the path to a specific file.

Import the CSV data set that contains a comprehensive spatial inventory of 98,745 public health facilities in Sub Saharan Africa directly from the following url and store it into a data frame called df4.

To learn more about how this data set was assembled, please refer to (1)

Tip
  • Stata: use the import excel Stata command with the stata function from the RStata package.
  • R: use the read.xlsx function from the openxlsx package.
  • Python: use the read_excel function from the pandas package. In the latest version of pandas (0.19.2) you can directly pass the url
```{r}
# Write your code here
```

Here, because quotes are already used for the Stata command, you need to use the other type of quotes for indicating the URL.

```{r}
library(RStata)

#df2 <- RStata::stata('import delimited "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"',
                     # data.out = TRUE)
```

Use the read.xlsx function from the openxlsx package.

```{r}
#csv_url <- "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"
#df4 <- read.csv(csv_url)
```

Use the read_excel function from the pandas package.

```{python}
import pandas

#csv_url = "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"
#df1 = pandas.read_csv(csv_url)
```