1. 程式人生 > >Building a Repeatable Data Analysis Process with Jupyter Notebooks

Building a Repeatable Data Analysis Process with Jupyter Notebooks

Notebook Structure

Once I create each notebook, I try to follow consistent processes for describing the notebooks. The key point to keep in mind is that this header is the first thing you will see when you are trying to figure out how the notebook was used. Trust me, future you will be eternally thankful if you take the time to put some of these comments in the notebook!

Here’s an image of the top of an example notebook:

Notebook header

There are a couple of points I always try to include:

  • A good name for the notebook (as described above)
  • A summary header that describes the project
  • Free form description of the business reason for this notebook. I like to include names, dates and snippets of emails to make sure I remember the context.
  • A list of people/systems where the data originated.
  • I include a simple change log. I find it helpful to record when I started and any major changes along the way. I do not update it with every single change but having some date history is very beneficial.

I tend to include similar imports in most of my notebooks:

import pandas as pd
from pathlib import Path
from datetime import datetime

Then I define all my input and output file paths and directories. It is very useful to do this all in one place at the top of the file. The other key thing I try to do is make all of my file path references relative to the notebook directory. By using Path.cwd() I can move notebook directories around and it will still work.

I also like to include date and time stamps in the file names. The new f-strings plus pathlib make this simple:

today = datetime.today()
sales_file = Path.cwd() / "data" / "raw" / "Sales-History.csv"
pipeline_file = Path.cwd() / "data" / "raw" / "pipeline_data.xlsx"
summary_file = Path.cwd() / "data" / "processed" / f"summary_{today:%b-%d-%Y}.pkl"

If you are not familiar with the Path object, my previous article might be useful.

The other important item to keep in mind is that raw files should NEVER be modified.

The next section of most of my notebooks includes a section to clean up column names. My most common steps are:

  • Remove leading and trailing spaces in column names
  • Align on a naming convention (dunder, CamelCase, etc.) and stick with it
  • When renaming columns, do not include dashes or spaces in names
  • Use a rename dictionary to put all the renaming options in one place
  • Align on a name for the same value. Account Num, Num, Account ID might all be the same. Name them that way!
  • Abbreviations may be ok but make sure it is consistent (for example - always use num vs number)

After cleaning up the columns, I make sure all the data is in the type I expect/need. This previous article on data types should be helpful:

  • If you have a need for a date column, make sure it is stored as one.
  • Numbers should be int or float and not object
  • Categorical types can be used based on your discretion
  • If it is a Yes/No, True/False or 1/0 field make sure it is a boolean
  • Some data like US zip codes or customer numbers might come in with a leading 0. If you need to preserve the leading 0, then use an object type.

Once the column names are cleaned up and the data types are correct, I will do the manipulation of the data to get it in the format I need for further analysis.

Here are a few other guidelines to keep in mind:

  • If you find a particular tricky piece of code that you want to include, be sure to keep a link to where you found it in the notebook.

  • When saving files to Excel, I like to create an ExcelWriter object so I can easily save multiple sheets to the output file. Here is what it looks like:

    writer = pd.ExcelWriter(report_file, engine='xlsxwriter')
    df.to_excel(writer, sheet_name='Report')
    writer.save()