Building a Repeatable Data Analysis Process with Jupyter Notebooks

阿新 • • 發佈：2018-12-31

Notebook Structure

Once I create each notebook, I try to follow consistent processes for describing the notebooks. The key point to keep in mind is that this header is the first thing you will see when you are trying to figure out how the notebook was used. Trust me, future you will be eternally thankful if you take the time to put some of these comments in the notebook!

Here’s an image of the top of an example notebook:

There are a couple of points I always try to include:

A good name for the notebook (as described above)
A summary header that describes the project
Free form description of the business reason for this notebook. I like to include names, dates and snippets of emails to make sure I remember the context.

A list of people/systems where the data originated.
I include a simple change log. I find it helpful to record when I started and any major changes along the way. I do not update it with every single change but having some date history is very beneficial.

I tend to include similar imports in most of my notebooks:

import pandas as pd
from pathlib import Path
from datetime import datetime

Then I define all my input and output file paths and directories. It is very useful to do this all in one place at the top of the file. The other key thing I try to do is make all of my file path references relative to the notebook directory. By using Path.cwd() I can move notebook directories around and it will still work.

I also like to include date and time stamps in the file names. The new f-strings plus pathlib make this simple:

today = datetime.today()
sales_file = Path.cwd() / "data" / "raw" / "Sales-History.csv"
pipeline_file = Path.cwd() / "data" / "raw" / "pipeline_data.xlsx"
summary_file = Path.cwd() / "data" / "processed" / f"summary_{today:%b-%d-%Y}.pkl"

If you are not familiar with the Path object, my previous article might be useful.

The other important item to keep in mind is that raw files should NEVER be modified.

The next section of most of my notebooks includes a section to clean up column names. My most common steps are:

Remove leading and trailing spaces in column names
Align on a naming convention (dunder, CamelCase, etc.) and stick with it
When renaming columns, do not include dashes or spaces in names
Use a rename dictionary to put all the renaming options in one place
Align on a name for the same value. Account Num, Num, Account ID might all be the same. Name them that way!
Abbreviations may be ok but make sure it is consistent (for example - always use num vs number)

After cleaning up the columns, I make sure all the data is in the type I expect/need. This previous article on data types should be helpful:

If you have a need for a date column, make sure it is stored as one.
Numbers should be int or float and not object
Categorical types can be used based on your discretion
If it is a Yes/No, True/False or 1/0 field make sure it is a boolean
Some data like US zip codes or customer numbers might come in with a leading 0. If you need to preserve the leading 0, then use an object type.

Once the column names are cleaned up and the data types are correct, I will do the manipulation of the data to get it in the format I need for further analysis.

Here are a few other guidelines to keep in mind:

If you find a particular tricky piece of code that you want to include, be sure to keep a link to where you found it in the notebook.
When saving files to Excel, I like to create an ExcelWriter object so I can easily save multiple sheets to the output file. Here is what it looks like:
```
writer = pd.ExcelWriter(report_file, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Report')
writer.save()
```

Building a Repeatable Data Analysis Process with Jupyter Notebooks

Notebook Structure

Building a Repeatable Data Analysis Process with Jupyter Notebooks

Building a Big Data Pipeline With Airflow, Spark and Zeppelin

Building a Smart Air Pressure Sensor with Espruino and Angular

Building a Search-Engine Optimized PWA with Angular

Building a CI system for Go, with Jenkins

Do more with Data: Building a Data Supplier plugin for Sketch

Starting to develop in PySpark with Jupyter installed in a Big Data Cluster

Building a Data Processing Pipeline with Amazon Kinesis Data Streams and Kubeless

Building our data science platform with Spark and Jupyter

Data Analysis with Python : Exercise- Titantic Survivor Analysis | packtpub.com

Objects are not valid as a React child (found: object with keys {status, data, operationId, correlat

轉錄組分析綜述A survey of best practices for RNA-seq data analysis

Building a React Native App With Complex Navigation Using React Navigation

Tutorial for building a Web Application with Amazon S3, Lambda, DynamoDB and API Gateway

Building a banking voice bot with Dialogflow and KOOKOO.

Merging Django ORM with SQLAlchemy for Easier Data Analysis

Building a REPL System with Go & Docker

Building a Slack Bot with Go and Wit.ai @ Alex Pliutau's Blog

Building a Blog with VueJS and AWS

Building a Simple Web App With Bottle, SQLAlchemy, and the Twitter API

Building a Repeatable Data Analysis Process with Jupyter Notebooks

Notebook Structure

相關推薦