```{r}
<- "./data/mydata.csv"
relative_path ```
14 :blue_book: Import external data
14.1 Introduction
Most of the time you will want to generate Quarto documents using your own data. To this aim, you will have to import data from external sources such as files, URLs, or server data (e.g., ODK Central data). There is a dedicated importing function in R and Python for almost every data format. In this section we show you how to import Stata (.dta
), Excel (.xlsx
) and comma-separated values (CSV, .csv
) data formats from files and from URLs, as well as how to import ODK data directly from an ODK Central server.
As you import data to further process / analyse them, you have to store the imported data in a data frame.
14.2 Import data from files
All Python and R functions only require as input the path where the file you want to import is stored. This path has to be passed as a sequence of characters (character) within double (““) or single (’’) quotes.
The path can be either:
- relative to your Quarto document
- absolute
```{r}
<- "C:/Users/myuser/Documents/mydata.csv"
absolute_path ```
Functions in general have additional optional arguments.
14.2.1 Import Excel data
Import the Excel data set dataset1.xlsx and store it into a data frame called df1.
- Stata: use the import excel Stata command with the stata function from the
RStata
package. - R: use the read.xlsx function from the
openxlsx
package. - Python: use the read_excel function from the
pandas
package.
```{r}
# Write your code here
```
Use the import excel Stata command with the stata function from the RStata
package.
```{r}
#| df-print: kable
library(RStata)
<- RStata::stata("import excel ./data/dataset1a.xlsx",
df1_stata data.out = TRUE)
```
Warning in readLines(rdl): Zeile 1 scheint ein nul Zeichen zu enthalten
```{r}
#| df-print: kable
<- head(df1_stata, 5)
df1_stata_sub ::kable(df1_stata_sub)
knitr```
. import excel ./data/dataset1a.xlsx
(7 vars, 10,309 obs)
A | B | C | D | E | F | G |
---|---|---|---|---|---|---|
child_ID | SDC_sex | SDC_age_in_months | CLIN_fever | CLIN_fever_onset | CLIN_cough | CLIN_diarrhoea |
1 | 2 | 10 | 0 | 1 | 1 | |
2 | 2 | 6 | 0 | 0 | 1 | |
3 | 1 | 6 | 0 | 0 | 0 | |
4 | 1 | 11 | 1 | 3 | 1 | 1 |
Use the read.xlsx function from the openxlsx
package.
```{r}
library(openxlsx)
<- openxlsx::read.xlsx("./data/dataset1a.xlsx")
df1_r <- head(df1_r, 5)
df1_r_sub ::kable(df1_r_sub)
knitr```
child_ID | SDC_sex | SDC_age_in_months | CLIN_fever | CLIN_fever_onset | CLIN_cough | CLIN_diarrhoea |
---|---|---|---|---|---|---|
1 | 2 | 10 | 0 | NA | 1 | 1 |
2 | 2 | 6 | 0 | NA | 0 | 1 |
3 | 1 | 6 | 0 | NA | 0 | 0 |
4 | 1 | 11 | 1 | 3 | 1 | 1 |
5 | 2 | 21 | 1 | 2 | 0 | 0 |
Use the read_excel function from the pandas
package.
```{python}
import pandas
= pandas.read_excel('./data/dataset1a.xlsx')
df1_python = df1_python.head(5)
df1_python_sub ```
C:\Users\langhe\AppData\Local\R\win-library\4.3\reticulate\python\rpytools\loader.py:117: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
return _find_and_load(name, import_)
library(reticulate)
Warning: Paket 'reticulate' wurde unter R Version 4.3.3 erstellt
::kable(py$df1_python_sub) knitr
child_ID | SDC_sex | SDC_age_in_months | CLIN_fever | CLIN_fever_onset | CLIN_cough | CLIN_diarrhoea |
---|---|---|---|---|---|---|
1 | 2 | 10 | 0 | NaN | 1 | 1 |
2 | 2 | 6 | 0 | NaN | 0 | 1 |
3 | 1 | 6 | 0 | NaN | 0 | 0 |
4 | 1 | 11 | 1 | 3 | 1 | 1 |
5 | 2 | 21 | 1 | 2 | 0 | 0 |
14.2.2 Import CSV data
Read the CSV data set dataset1.csv and store it into a data frame called df2.
```{r}
# Write your code here
```
```{r}
library(RStata)
<- RStata::stata("import delimited ./data/dataset1b.csv",
df2 data.out = TRUE)
```
. import delimited ./data/dataset1b.csv
(encoding automatically selected: ISO-8859-1)
(7 vars, 10,308 obs)
```{r}
<- read.csv("./data/dataset1b.csv")
df2_r <- head(df2_r, 5)
df2_r_sub ::kable(df2_r_sub)
knitr```
child_ID | SDC_sex | SDC_age_in_months | CLIN_fever | CLIN_fever_onset | CLIN_cough | CLIN_diarrhoea |
---|---|---|---|---|---|---|
1 | 2 | 10 | 0 | NA | 1 | 1 |
2 | 2 | 6 | 0 | NA | 0 | 1 |
3 | 1 | 6 | 0 | NA | 0 | 0 |
4 | 1 | 11 | 1 | 3 | 1 | 1 |
5 | 2 | 21 | 1 | 2 | 0 | 0 |
Use the read_csv function from the pandas
package.
```{python}
import pandas
= pandas.read_csv('./data/dataset1b.csv')
df2 ```
14.2.3 Import Stata data
Read the Stata data set dataset1.dta and store it into a data frame called df3.
- Stata: use the
use
Stata command with the stata function from theRStata
package. - R: use the read_dta function from the
haven
package. This package supports SAS, STATA and SPSS software. - Python: use the read_stata function from the
pandas
package.
```{r}
# Write your code here
```
```{r}
library(RStata)
<- RStata::stata("use ./data/dataset1c.dta",
df3 data.out = TRUE)
```
. use ./data/dataset1c.dta
```{r}
library(haven)
<- haven::read_dta("./data/dataset1c.dta")
df3 ```
Use the read_stata function from the pandas
package.
```{python}
import pandas
= pandas.read_stata('./data/dataset1c.dta')
df3 ```
14.3 Import data from URLs
All functions can accept URLs as well instead of the path to a specific file.
Import the CSV data set that contains a comprehensive spatial inventory of 98,745 public health facilities in Sub Saharan Africa directly from the following url and store it into a data frame called df4.
To learn more about how this data set was assembled, please refer to (1)
- Stata: use the import excel Stata command with the
stata
function from theRStata
package. - R: use the read.xlsx function from the
openxlsx
package. - Python: use the read_excel function from the
pandas
package. In the latest version of pandas (0.19.2) you can directly pass the url
```{r}
# Write your code here
```
Here, because quotes are already used for the Stata command, you need to use the other type of quotes for indicating the URL.
```{r}
library(RStata)
#df2 <- RStata::stata('import delimited "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"',
# data.out = TRUE)
```
Use the read.xlsx function from the openxlsx
package.
```{r}
#csv_url <- "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"
#df4 <- read.csv(csv_url)
```
Use the read_excel function from the pandas
package.
```{python}
import pandas
#csv_url = "https://open.africa/dataset/d7335980-29d5-476c-bf7a-feb4e22cf631/resource/e2432e8a-cf15-4a8b-b8c3-567f443c1459/download/cfa-data-hospitals-in-africa-00-ssa-mfl-130219.xlsx-ssa-mfl.csv"
#df1 = pandas.read_csv(csv_url)
```