Run your data projects like a professional kitchen
2 tools that Data Scientists use to ensure reproducibility
Anyone working on a project using a scripting language like Python has probably experienced a situation where they have to run someone’s "well-documented" code, only to discover that it doesn’t work. Maybe a package is missing, or a data file doesn’t exist. Or worse yet, you come back to your own code after several months and are unable to replicate your past results.
Reproducibility is often an afterthought when we are in the thick of our data-related work. However, neglecting it can hurt us later if we need to share our work more widely, such as when submitting it to an academic journal or sharing it publicly so that others may contribute.
It’s like creating a beautiful meal but not properly writing down the recipe. A reproducible project also makes it easier for us to delegate the task to someone else.
Today, I will outline two tools that you can use for any new Python project to ensure reproducibility. These tools include:
Project isolation with virtual environments
Code versioning and hosting with Github
This framework is used in the data science industry to ensure that teams can easily collaborate to build data-driven products and models.
A step-by-step guide using VS Code is provided below so that you can apply these principles to your project right away.
The kitchen-related analogies are about to come thick and fast. Let’s get started!
Autonomous Econ is a newsletter that empowers analysts who work with economic data by equipping them with Python and data science skills. The content is designed to boost their productivity through automation and transform them into savvier analysts.
Posts will roughly be split between practical guides (80 percent) and data journalism pieces where I demonstrate the tools (20 percent).
If you find the content valuable, please consider becoming a free subscriber by clicking on the button below.
Isolate your projects as work stations
Imagine asking a friend to come into your kitchen and telling them to make a particular dish and plate it in a specific way. Sure, they might figure it out eventually, but they will probably waste a lot of time searching for the right ingredients and tools..
A more efficient approach would be to do what they do in professional kitchens and use workstations for a particular dish. At each station, we have the ingredients, tools, and crockery that are compatible with each other and required for that one meal. If you have watched the TV series The Bear, then you know how vital workstations are for an optimally functioning kitchen.
Virtual environments are essentially minimalist workstations for each project (dish) that contain only the packages needed for that specific project. For example, if you have a project to build a Streamlit dashboard, you might only need to install the Pandas and Streamlit packages.
These environments also mean that if one project uses Pandas version 2.0 and another uses Pandas 2.5, you don’t need to worry about changing the package in your global environment. Virtual environments allow users to easily switch between projects seamlessly.
Now, if someone wants to recreate your project, they can set up the same virtual environment on their computer according to your project requirements. To do this, they first need to access your project via GitHub.
Github-your public recipe library
Github is like a public library of recipe books (projects / code repositories) that anyone can access online. For example, the Economist has a public repository for their Big Mac Index.
It will typically include a Readme with general instructions and a requirements.txt file that specifies the packages and associated versions that you need.
GitHub's main purpose is to facilitate team collaboration on projects. Code is stored in the cloud, allowing anyone with access to download it, suggest changes, and contribute with approval from other contributors. Storing code in the cloud also ensures you don't have to worry about losing local files on your computer.
All the changes you make to your code are also tracked in Github. Every time you save a change, known as a "commit," it is included in the commit history of the repository. Each commit also has an attached commit message, which is handy for other contributors and yourself to understand the reason for each change.
For a great explainer on Github, check out the post below from
.Since every commit is tracked, you can revert your project to a point when things were working if something breaks.
GitHub has some other neat features; for example, you can showcase your project as a website using GitHub Pages. I'll save this for a future post.
A step-by-step guide for any new project
Next, I will show you how to create a repo in GitHub from scratch and then set up a Python virtual environment on your computer.1 By the end of it, you should be able to recreate this interactive time series plot of the S&P 500. If you’re interested in the Python API I used to retrieve the data, then check out my Linkedin post.
Prerequsites:
Create GitHub account if you don’t have one.
Set up VS Code as your Integrated Development Environment (IDE) and download Python—see my previous post on how to do this. You don’t necessarily have to use VS Code, but I believe it is the most user-friendly. Other IDEs should have a similar workflow.
Download Git for your computer.
(1) Create a Github repo
Navigate to the ‘Repositories’ tab and click on ‘New’. You should see a page like the one below. The key consideration here is whether to make your repo public or private. Just select Private for now if you are just testing the workflow.
(2) Clone repo
Once you have created the repo, click on the ‘Code’ button and copy the HTTPS link.
Launch VS Code and press Ctrl+Shift+P
(Windows/Linux) or Cmd+Shift+P
(macOS) to open the Command Palette. Type Git: Clone
and select it. Paste the repository URL you copied before. Then finally choose a directory on your computer to clone the repository into.
Note, you can do this for any other repository from Github, not just ones that you create.
(3) Create virtual env
Open the Command Palette again, type Create: Environment, and select it. When prompted to select the Python interpreter, choose the recommended global one.
You should now see a venv folder created in your directory on the left-hand side. Click on Terminal at the top of the window, then click New Terminal. You should now see .venv
in brackets, indicating that your environment has been activated. Everything you run in the terminal will now only impact your virtual environment.
(4) Install packages
In the terminal window, run the command below to install the following packages.
pip install pandas==2.2.2 plotly==5.23.0 nbformat==5.10.4
You can store the list of packages in a requirements.txt.
pip freeze > requirements.txt
This allows others who clone your repo to directly install the packages using just the requirements.txt file
pip install -r requirements.txt
At this point, you can create a new Python file called visualize.py and copy+paste the code snippet at the bottom of the post. Hit the big play button in the top right and you should see the plot.
(5) Commit and push your changes to Github
Once you verified that running the code worked, it is a good time to commit the change and then push this back to the repository in the cloud. It is best practice to commit changes regulary after verifying that each change works as expected.
Click on the ‘+’ button next to the visualize.py file to stage the change.2
Write a commit message and hit the commit button.
Click on Sync changes to send the changes to the Github repository.
You should see your commit in the repository on Github.
Use the above workflow for your next project, and let me know how it goes in the comments section below. This process should eventually become second nature and will ensure you have a well-tracked and reproducible project every time.
(5) Code snippet for S&P500 plot
import pandas as pd
import plotly.graph_objs as go
# URL of the raw CSV file from GitHub
url = 'https://raw.githubusercontent.com/martingeew/finance_dashboard_demo/main/sp500_20240807.csv'
# Read the CSV file into a DataFrame
df = pd.read_csv(url)
def plot_daily(df):
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=df.index,
y=df["close"],
mode="lines",
name="S&P 500 Close",
hovertemplate="%{x|%Y-%m-%d}<br>Close: %{y:.2f}<extra></extra>",
)
)
fig.update_layout(
title="Daily S&P 500 Closing value",
xaxis_title="",
yaxis_title="",
template="plotly_dark", # Use the dark mode template
)
fig.show()
plot_daily(df)
All the following steps can also be done via commands in the terminal window, but for newcomers, it is simpler to use the interface of VS Code.
Make sure the venv folder is not listed here and is ignored by Git. Otherwise add a gitignore file in the venv folder and put a *
in it.
thanks for the clear, easy-to-follow walkthrough. (I can speak from experience as a hobbyist VS Code user that not using virtual environments and versioning your code leads to endless headaches!)