Andrés Cruz
2022-09-30
Always keep raw datasets! Save temporary “clean” datasets if needed.
Common project structure:
my_project/
├─ raw/ # raw datasets
├─ temp/ # "clean" datasets
├─ output/ # figures, tables, model results
├─ 01_script_first.R
├─ 02_script_second.R # ...
├─ README.txt
project_example/
├─ raw/
│ ├─ qog_bas_cs_jan22.csv
├─ temp/
│ ├─ 01_qog_clean.csv
├─ output/
│ ├─ 02_model_coefplot.pdf
│ ├─ 02_model_results.rds
├─ 01_clean_data.R
├─ 02_run_model.R
├─ 99_paper_example.qmd
├─ 99_paper_example.pdf
├─ project_example.Rproj
├─ README_project_example.txt
A project can be anything: a paper, a homework assignment, your CV…
RStudio Projects are folders. Just like the ones we just saw. There’s an .Rproj file indicating that the folder is an RStudio Project.
Never worry about working directories again. Simply reference your files within the folder. For example: read_csv("raw/qog_bas_cs_jan22.csv").
Start afresh every time you open RStudio with the following options (Tools > Global Options… > General):
modelsummaryggplot2::ggsave()Keep code and output in the same document! No more copy-pasting tables.
You can write short reports, presentations (with revealjs or Beamer), or papers! Let’s see an example on 99_paper_example.qmd.
It compiles automatically to .html or .pdf (via \(\LaTeX\): everything works!).
modelsummarymodelsummary (2)modelsummary::modelsummary(
list("(1) OLS" = m1, "(2) FE" = m2),
gof_omit = "Log|RMSE|[AB]IC",
coef_map = c("(Intercept)" = "Intercept", "hp" = "Horspower")
)| (1) OLS | (2) FE | |
|---|---|---|
| Intercept | 30.099 | |
| (1.634) | ||
| Horspower | −0.068 | −0.024 |
| (0.010) | (0.015) | |
| Num.Obs. | 32 | 32 |
| R2 | 0.602 | |
| R2 Adj. | 0.589 | |
| Std.Errors | by: cyl | |
| FE: cyl | X |
modelsummary (3)ggplot2::ggsave()ggplot2::ggsave() (2)ggplot2::ggsave() instead:assertthat::assert_that()I highly recommend the tidyverse style guide (even if you don’t use the tidyverse). But here’s the most important part, in my opinion.
Use descriptive object names. df_anes is better than data.
Follow the snake_case convention for object names (including columns). Don’t use capital letters or spaces. df_anes is better than dfAnes.
Use janitor::clean_names() to clean messy variable names:
Let’s say that we want to multiple all of our variables by 100.
This gets old fast… and it’s prone to error.
dplyr functions can be iterated over columns with across(): mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 2100 600 16000 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 2100 600 16000 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 2280 400 10800 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 2140 600 25800 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 1870 800 36000 175 3.15 3.440 17.02 0 0 3 2
Valiant 1810 600 22500 105 2.76 3.460 20.22 1 0 3 1
Duster 360 1430 800 36000 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 2440 400 14670 62 3.69 3.190 20.00 1 0 4 2
Merc 230 2280 400 14080 95 3.92 3.150 22.90 1 0 4 2
Merc 280 1920 600 16760 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 1780 600 16760 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 1640 800 27580 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 1730 800 27580 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 1520 800 27580 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 1040 800 47200 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 1040 800 46000 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 1470 800 44000 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 3240 400 7870 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 3040 400 7570 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 3390 400 7110 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 2150 400 12010 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 1550 800 31800 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 1520 800 30400 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 1330 800 35000 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 1920 800 40000 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 2730 400 7900 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 2600 400 12030 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 3040 400 9510 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 1580 800 35100 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 1970 600 14500 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 1500 800 30100 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 2140 400 12100 109 4.11 2.780 18.60 1 1 4 2
assertthat::assert_that()Sometimes you perform an operation and then manually check whether it worked well. Formalize this so you don’t have to do it every time something else changes!
This is very common in joins. For instance, you might want your output to conserve the number of rows of the X data frame. If not, you know something went wrong…
renv to make a package vaulttargetsThis is the case in many procedures that use randomness (MCMC, cross-validation, etc).
Whenever you will run code with a random component, use a seed to ensure reproducibility:
renv to make a package vaultKeeping track of package versions is important for future reproducibility.
Using renv makes the most sense when making a replication package, or sharing an advance project with colleagues.
Run renv::init() in your project to create a record of all packages used and their versions. This will create a couple of folders in your project (most notably, renv/).
If you ever need to update this, run renv::snapshot().
When someone else wants to run your code, they start with renv::restore(). This will scan renv/ and the other files and install all needed packages in their appropriate versions.
targetsWhen projects grow large, it’s hard to keep track of how everything’s related.
Say that you just caught a bug in one of your data cleaning scripts. What analyses do you need to rerun downstream?
targets builds a dependency graph for your datasets/scripts. If you edit a node anywhere in the flow, targets knows what other nodes become out of sync, and can rerun them automatically.