How can actuaries ensure that workflows are efficient, minimise the risk of errors and allow complex work to be reproduced by others in their organisation?
Actuaries exploring the use of data science have the opportunity to revisit existing ways of working and consider whether they remain appropriate. These challenges are also being faced in science and by other professions.
This presentation looks at why the concept of reproducible work is key and how it can help address the challenges of working in data intensive fields.
The presentation can be downloaded here.
This work was originally presented at the Data Science: Opportunities for Actuaries virtual event in February 2019.
In the following exercises you will set up a simple reproducible workflow using some of the tools introduced in the presentation.
As a toy example to demonstrate the approach, the analysis takes some cashflow data, projects it and calculates the present value. Automatic checks are set up and a report is produced.
The exercises use the R statistical programming language and RStudio (a popular development environment for R).
R and RStudio are free open source tools available for Windows, Mac OS X and Linux. R can be downloaded here and the open source edition of RStudio Desktop can be downloaded here.
Execute code by running a script or by entering it in the console.
To create a script select
File / New File / R Script. To run the code in a script click
Source at the top right of the script window. Alternatively you can highlight a subsection of your code and click
Output is displayed in the console or the plots pane. Use the tabs at the top of the plots pane to access Files, Packages and Help.
Using Git for version control
A version control system is a key part of a reproducible workflow.
Version control is like having an unlimited “undo” button. It avoids managing multiple versions of the same file and allows many people to work in parallel on a project. Most software developers and data scientists use version control for their work.
If you want to use version control alongside these exercises, this guide explains how to set up and use Git with GitHub.
If you are new to R or RStudio you can find online learning resources here. In particular we recommend R for Data Science by Hadley Wickham and Garret Grolemund.
- Exercise 1: Use ProjectTemplate to structure an data analysis project in R
- Exercise 2: Import and pre-process your data
- Exercise 3: Helper functions and R libraries
- Exercise 4: Forecast the cashflows and calculate the present value
- Exercise 5: Create a report using R Markdown
- Exercise 6: Test your results using testthat
- Exercise 7: Updating the analysis and report
- Next steps
You can download the final project structure from the accompanying GitHub repository.