Introducing the principles of Tidy Data

Najko Jahn

What is Tidy Data?

According to Hadley Wickham tidy data is a standard way to organize data values within a dataset

From R for Data Science

Why is it important?

  • Organizing datasets as tidy data makes data cleaning efforts easier
  • Broad range of analytical tools are built upon the assumption to consume tidy data
  • Sharing tidy data increases re-use

Tools to tidy data

http://tidyverse.org/

install.packages("tidyverse")

Gathering with tidyr

Problem: One variable might be spread across multiple columns.

library(tidyverse)
table4b
# A tibble: 3 × 3
      country     `1999`     `2000`
*       <chr>      <int>      <int>
1 Afghanistan   19987071   20595360
2      Brazil  172006362  174504898
3       China 1272915272 1280428583

Gathering with tidyr

table4b %>%
  gather(`1999`, `2000`, key = "year", value = "population")
# A tibble: 6 × 3
      country  year population
        <chr> <chr>      <int>
1 Afghanistan  1999   19987071
2      Brazil  1999  172006362
3       China  1999 1272915272
4 Afghanistan  2000   20595360
5      Brazil  2000  174504898
6       China  2000 1280428583

Gathering with tidyr

table4b %>%
  gather(`1999`, `2000`, key = "year", value = "population") %>%
  group_by(country) %>%
  summarize(mean(population))
# A tibble: 3 × 2
      country `mean(population)`
        <chr>              <dbl>
1 Afghanistan           20291216
2      Brazil          173255630
3       China         1276671928

Spreading with tidyr

Problem: One observation might be scattered across multiple rows.

head(table2)
# A tibble: 6 × 4
      country  year       type     count
        <chr> <int>      <chr>     <int>
1 Afghanistan  1999      cases       745
2 Afghanistan  1999 population  19987071
3 Afghanistan  2000      cases      2666
4 Afghanistan  2000 population  20595360
5      Brazil  1999      cases     37737
6      Brazil  1999 population 172006362

Spreading with tidyr

spread(table2, key = type, value = count)
# A tibble: 6 × 4
      country  year  cases population
*       <chr> <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3      Brazil  1999  37737  172006362
4      Brazil  2000  80488  174504898
5       China  1999 212258 1272915272
6       China  2000 213766 1280428583

Separate with tidyr

Problem: One column contains more than one variable

table3
# A tibble: 6 × 3
      country  year              rate
*       <chr> <int>             <chr>
1 Afghanistan  1999      745/19987071
2 Afghanistan  2000     2666/20595360
3      Brazil  1999   37737/172006362
4      Brazil  2000   80488/174504898
5       China  1999 212258/1272915272
6       China  2000 213766/1280428583

Separate with tidyr

Problem: One column contains more than one variable

table3 %>%
  separate(rate, into = c("cases", "population"))
# A tibble: 6 × 4
      country  year  cases population
*       <chr> <int>  <chr>      <chr>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3      Brazil  1999  37737  172006362
4      Brazil  2000  80488  174504898
5       China  1999 212258 1272915272
6       China  2000 213766 1280428583

And now it's your turn!

Examples taken from:

http://r4ds.had.co.nz/tidy-data.html#non-tidy-data