24  2022 Tanzania

24.1 Learning objectives

Two complementary aspects of moving into data science are:

  1. the mindset about how scientists think and collaborate about data, and
  2. the skillsets which is composed of an ecosystem of tools (mostly open-source) and practices.

Upon completing the workshop, participants will have gained:

  • exposure to data science approach, tools and collaborative practices
  • hands-on experience on how to interface between Stata and R, learned the basics of working with data in R/RStudio, and how to incrementally incorporate R into your existing data analysis workflows in Stata. The idea is not to replace everything you do in Stata into R but that you can continue your learning after this workshop at your own pace.

24.2 Is this workshop for me?

This workshop is relevant for individuals who answer yes to the following questions:

  • Do you who want to develop data science projects in public health?
  • Do you wants to learn more about how open and reproducible science approaches can be used in your daily practice?
  • Are you a Stata user (or any other data analysis language) who would like to expand your data analysis skillset with R?
  • Do you want to bridge analyses between data analysis tools (Stata, R or Python) and to more easily collaborate with other researchers who use another of these tools?

24.3 Schedule

🗓️ September 26-28, 2022
🕘 09:00 - 17:00
🌇 Dar-es-Salaam, Tanzania (Protea Hotel by Marriott Dar es Salaam Courtyard)

24.3.1 Before the workshop

  1. Fill out the online pre-workshop questionnaire
  2. Install on your laptop the (free) data science software that will be used during the workshop. If you have any difficulties with the installation, support can be provided on the first day of the workshop before the first session or during breaks.

24.3.2 Day 1

Table 24.1: Schedule Day 1
Time Session
08.30 - 09.00 Welcome
Support for software installation
09.00 - 09.15 Introduction to data science tools
Overview of objectives for Day 1
09.15 - 10.30 Version control with Git
10.30 - 11.00 🍵 Break
11.00 -12.00 Introduction to dynamic documents and Quarto
12.00 - 13.00 Use Quarto with Stata
13.00 - 14.00 🍴 Lunch break
14.00 - 15.00 Import and manipulate external data (1)
15.00 - 15.30 Import and manipulate external data (2)
15.30 - 16.00 🍵 Break
16.00 - 17.00 Share code and Collaborate with Git

24.3.3 Day 2

Table 24.2: Schedule Day 2
Time Session (all)
08.30 - 09.00 Welcome
09.00 - 09.15 Introduction to Data Science for Public Health
Overview of objectives for Day 2
09.15 - 10.30 Discussion on concepts related to to health data for decision-making
10.30 - 11.00 🍵 Break
11:00-11:15 Malaria use case - Presentation of the data
11.15 - 11.45 Malaria use case - Interdisciplinary discussion
11.45 - 12.30 Malaria use case - Data practicals by interdisciplinary groups
12.30 - 13.00 Malaria use case - Feedback on findings from practicals
13.00 - 14.00 🍴 Lunch break
14.00 - 14.30 Malaria use case - Interdisciplinary discussion
14.00 - 15.30

Malaria use case

Analysis: data practicals
Interpretation: discussion on data sources and interpretation

15.30 - 16.00 🍵 Break
16.00 - 17.00 Malaria use case - Feedback on praticals

24.3.4 Day 3

Table 24.3: Schedule Day 3
Time Session (all)
08.30 - 09.00 Welcome
09.00 - 09.15

Interdisciplinary introduction to big data and machine Learning

Overview of objectives for Day 3

09.15 - 10.00

Discussion on secondary data sources

(Public Datasets, e.g. DHS, Facebook, facilities, etc)

Benefits and drawbacks between primary and secondary data sources

10.00 - 10.30 🍵 Break
11.00 - 13.00 Analysis: Introduction to machine learning
Interpretation: Critically discuss data surveys/reports
13:00-14:00 🍴 Lunch break
14:00-14:30 Speed talks - research presentations
14:30-15:30 Feedback on findings from practicals
15:30-16:30
Feedback on workshop - Wrap-up

24.3.5 After the workshop

  1. Fill out the online post-workshop questionnaire

24.4 Scope

This workshop aims to accompany researchers to progress on the following development axes:

24.4.1 Data science mindset

  • Use of reproducible research practices in public health
  • Data provenance
    • Use of distinct data sources for the development of public health indicators
    • Research data vs. real world evidence data
  • Ethical data science
  • Data papers

24.4.2 Data science skillset

  • Programming tools
    • Move from Stata to R (prerequisite: Stata)
    • R programming
      • dplyr
    • Python programming
      • pandas
      • scikit-learn (prerequisite: independent Python user)
  • Coding with best practices (R/RStudio/tidyverse)
    • Versioning using GitHub (all)
    • Using targets (prerequisite: independent R user)
  • Reporting and publishing: Dynamic report generation
  • Reproducible data
    • Use APIs (prerequisite: IT programming basics)
    • Open access data (all)
  • Statistical methods for reproducible research (advanced)

24.4.3 What is not covered

  • Reproducible workflows (targets)
  • Reproducible environments (Binder, Docker, renv, etc)

24.5 Conventions

Discussion activity 💬

Reflection activity 💭

Coding activity 💻