Managing data projects

Andrés Cruz

2022-09-30

Roadmap (full)

1 Follow the one-project-one-folder rule

  • Structure your projects purposely.
  • Use RStudio projects.

Structure your projects purposely

  • Always keep raw datasets! Save temporary “clean” datasets if needed.

  • Common project structure:

my_project/
├─ raw/    # raw datasets
├─ temp/   # "clean" datasets
├─ output/ # figures, tables, model results
├─ 01_script_first.R
├─ 02_script_second.R # ...
├─ README.txt
  • Our example today:
project_example/
├─ raw/
│  ├─ qog_bas_cs_jan22.csv
├─ temp/
│  ├─ 01_qog_clean.csv
├─ output/
│  ├─ 02_model_coefplot.pdf
│  ├─ 02_model_results.rds
├─ 01_clean_data.R
├─ 02_run_model.R
├─ 99_paper_example.qmd
├─ 99_paper_example.pdf
├─ project_example.Rproj
├─ README_project_example.txt

Use RStudio projects

  • A project can be anything: a paper, a homework assignment, your CV…

  • RStudio Projects are folders. Just like the ones we just saw. There’s an .Rproj file indicating that the folder is an RStudio Project.

  • Never worry about working directories again. Simply reference your files within the folder. For example: read_csv("raw/qog_bas_cs_jan22.csv").

  • Start afresh every time you open RStudio with the following options (Tools > Global Options… > General):

    • Uncheck “Restore .RData into workspace at startup”
    • Set “Save Workspace to .RData on exit” to “Never”

2 Automate your output creation

  • Use Quarto (or R Markdown)
  • Make publication-ready tables with modelsummary
  • Save plots with ggplot2::ggsave()

Use Quarto (or R Markdown)

  • Keep code and output in the same document! No more copy-pasting tables.

  • You can write short reports, presentations (with revealjs or Beamer), or papers! Let’s see an example on 99_paper_example.qmd.

  • It compiles automatically to .html or .pdf (via \(\LaTeX\): everything works!).

    • Installing \(\LaTeX\) locally is fairly easy. Just run the following code one time and you’ll be able to compile to PDF:
install.packages("tinytex")
tinytex::install_tinytex()
  • You can insert figures, tables, citations, \(\LaTeX\) equations…

Make publication-ready tables with modelsummary

m1 <- lm(mpg ~ hp, data = mtcars)
m2 <- fixest::feols(mpg ~ hp | cyl, data = mtcars, vcov = "cluster")
modelsummary::modelsummary(list(m1, m2))
Model 1 Model 2
(Intercept) 30.099
(1.634)
hp −0.068 −0.024
(0.010) (0.015)
Num.Obs. 32 32
R2 0.602
R2 Adj. 0.589
AIC 181.2 163.9
BIC 185.6 166.8
Log.Lik. −87.619
RMSE 3.74 2.94
Std.Errors by: cyl
FE: cyl X

Make publication-ready tables with modelsummary (2)

modelsummary::modelsummary(
  list("(1) OLS" = m1, "(2) FE" = m2), 
  gof_omit = "Log|RMSE|[AB]IC", 
  coef_map = c("(Intercept)" = "Intercept", "hp" = "Horspower")
)
(1) OLS (2) FE
Intercept 30.099
(1.634)
Horspower −0.068 −0.024
(0.010) (0.015)
Num.Obs. 32 32
R2 0.602
R2 Adj. 0.589
Std.Errors by: cyl
FE: cyl X

Make publication-ready tables with modelsummary (3)

  • If compiling to PDF/\(\LaTeX\):

Save plots with ggplot2::ggsave()

p_mpg <- ggplot(mtcars, aes(x = mpg)) +
  geom_histogram()
p_mpg

Save plots with ggplot2::ggsave() (2)

  • Don’t use “Export” from RStudio. It’s not reproducible!

  • Use ggplot2::ggsave() instead:
ggsave(plot = p_mpg, filename = "output/plot_mpg.png",
       width = 8,  # other values to try: 10, 12
       height = 6, 
       scale = .8) # lower values will make text and other elements larger

3 Write more solid code

  • Write abstract code when possible.
  • Write code in a consistent style
  • Don’t trust copy-pasting; write iterations!
  • Add sanity checks with assertthat::assert_that()

Write abstract code when possible

  • We want our code to be flexible enough to withstand minor changes upstream. In general, use objects/names instead of numbers!

For columns:

# bad
mtcars[, 1]
# good
mtcars$mpg
mtcars[["mpg"]]

For rows:

# bad
sum(mtcars$mpg) / 32
# good
sum(mtcars$mpg) / nrow(mtcars)

Write code in a consistent style

  • I highly recommend the tidyverse style guide (even if you don’t use the tidyverse). But here’s the most important part, in my opinion.

  • Use descriptive object names. df_anes is better than data.

  • Follow the snake_case convention for object names (including columns). Don’t use capital letters or spaces. df_anes is better than dfAnes.

  • Use janitor::clean_names() to clean messy variable names:

readxl::read_excel("raw/example_excel.xlsx") %>% 
  janitor::clean_names()
# A tibble: 2 × 3
  id_variable gdp_agr_fish x2010_production_log
        <dbl>        <dbl>                <dbl>
1           1         3000                   30
2           2         2000                   20

Don’t trust copy-pasting; write iterations!

  • Let’s say that we want to multiple all of our variables by 100.

  • This gets old fast… and it’s prone to error.

mtcars$mpg  <- mtcars$mpg * 100
mtcars$cyl  <- mtcars$cyl * 100
mtcars$disp <- mtcars$disp * 100
# ...
  • Use iterations instead! Here’s a for-loop:
mtcars2 <- mtcars
for (var in colnames(mtcars)){
  mtcars[[var]] <- mtcars[[var]] * 100
}
  • dplyr functions can be iterated over columns with across():
mtcars <- mtcars %>% mutate(across(everything(), ~ . * 100))
mtcars <- mtcars %>% mutate(across(where(is.numeric), ~ . * 100))
( mtcars <- mtcars %>% mutate(across(mpg:disp, ~ . * 100)) )
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           2100 600 16000 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       2100 600 16000 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          2280 400 10800  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      2140 600 25800 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   1870 800 36000 175 3.15 3.440 17.02  0  0    3    2
Valiant             1810 600 22500 105 2.76 3.460 20.22  1  0    3    1
Duster 360          1430 800 36000 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           2440 400 14670  62 3.69 3.190 20.00  1  0    4    2
Merc 230            2280 400 14080  95 3.92 3.150 22.90  1  0    4    2
Merc 280            1920 600 16760 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           1780 600 16760 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          1640 800 27580 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          1730 800 27580 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         1520 800 27580 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  1040 800 47200 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 1040 800 46000 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   1470 800 44000 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            3240 400  7870  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         3040 400  7570  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      3390 400  7110  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       2150 400 12010  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    1550 800 31800 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         1520 800 30400 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          1330 800 35000 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    1920 800 40000 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           2730 400  7900  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       2600 400 12030  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        3040 400  9510 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      1580 800 35100 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        1970 600 14500 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       1500 800 30100 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          2140 400 12100 109 4.11 2.780 18.60  1  1    4    2

Add sanity checks with assertthat::assert_that()

  • Sometimes you perform an operation and then manually check whether it worked well. Formalize this so you don’t have to do it every time something else changes!

  • This is very common in joins. For instance, you might want your output to conserve the number of rows of the X data frame. If not, you know something went wrong…

( df_x <- data.frame(id = c("A", "B", "C"), x_val = 1:3) )
( df_y <- data.frame(id = c("B", "C", "C"), y_val = 11:13) )
  id x_val
1  A     1
2  B     2
3  C     3
  id y_val
1  B    11
2  C    12
3  C    13
df_merged <- left_join(df_x, df_y)
assertthat::assert_that(nrow(df_x) == nrow(df_merged))
Error: nrow(df_x) not equal to nrow(df_merged)
assertthat::assert_that(sum(is.na(df_merged)) == 0)
Error: sum(is.na(df_merged)) not equal to 0

4 Make your workflow reproducible

  • Set seeds when randomizing
  • Use renv to make a package vault
  • Automate your entire workflow with targets

Set seeds when randomizing

  • Every time you run this, you’ll get a different result:
rnorm(n = 5, mean = 0, sd = 1)
[1]  0.53327428 -1.92674383 -0.09157776 -1.27632757  1.04407709
  • This is the case in many procedures that use randomness (MCMC, cross-validation, etc).

  • Whenever you will run code with a random component, use a seed to ensure reproducibility:

set.seed(123)
rnorm(n = 5, mean = 0, sd = 1)
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774
set.seed(123)
rnorm(n = 5, mean = 0, sd = 1)
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774

Use renv to make a package vault

  • Keeping track of package versions is important for future reproducibility.

  • Using renv makes the most sense when making a replication package, or sharing an advance project with colleagues.

  • Run renv::init() in your project to create a record of all packages used and their versions. This will create a couple of folders in your project (most notably, renv/).

  • If you ever need to update this, run renv::snapshot().

  • When someone else wants to run your code, they start with renv::restore(). This will scan renv/ and the other files and install all needed packages in their appropriate versions.

Automate your entire workflow with targets

  • When projects grow large, it’s hard to keep track of how everything’s related.

  • Say that you just caught a bug in one of your data cleaning scripts. What analyses do you need to rerun downstream?

  • targets builds a dependency graph for your datasets/scripts. If you edit a node anywhere in the flow, targets knows what other nodes become out of sync, and can rerun them automatically.

Other resources