1. 程式人生 > >Extracting and Transforming Data in Python

Extracting and Transforming Data in Python

Extracting and Transforming Data in Python

In order to get insights from data you have to play with it a little..

It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. And in today’s article I will list those techniques with code samples and some explanations. Let’s get started.

For this article I’ve created a sample DataFrame with random numbers to play with it. We will use this data as an example during the explanations in the article.

import pandas as pdimport numpy as npcols = ['col0', 'col1', 'col2', 'col3', 'col4']rows = ['row0', 'row1', 'row2', 'row3', 'row4']data = np.random.randint(0, 100, size=(5, 5))df = pd.DataFrame(data, columns=cols, index=rows)df.head()
Out[2]:       col0  col1  col2  col3  col4row0    24    78    42     7    96row1    40     4    80    12    84row2    83    17    80    26    15row3    92    68    58    93    33row4    78    63    35    70    95

Indexing DataFrames

To extract data from pandas DataFrame we can use direct indexing or accessors. We can select necessary rows and columns using it’s labels:

df['col1']['row1']
Out[3]: 4

Please, note the order in this type of indexing: first you specify column label and then row. But the truth is, datasets with is rare and small, while in real life we work with much heavier machinery. It is much better to select data using accessors — .loc and .iloc. The difference between them is that .loc accepts labels and .iloc — indexes. Also when we use accessors first we specify rows and then columns. I had some difficult time in the beginning to get used to it — SQL background, what else can you say..

So, to select a single value using accessors you’d do the following:

df.loc['row4', 'col2']Out[4]: 35 
df.iloc[4, 2]Out[5]: 35

Using indexing we can select a single value, Series or DataFrame from a DataFrame (sorry for tautology). Above I have shown how to select a value.

To subselect few columns, just pass nested list of it’s labels and DataFrame will be returned:

df_new = df[['col1','col2']]df_new.head(3)Out[6]:       col1  col2row0    78    42row1     4    80row2    17    80

If you want to select also specific rows, add its indexes and you will get a DataFrame again. This technique is called slicing and more in detail about it — below.

df_new = df[['col1','col2']][1:4]df_new.head(3)Out[7]:       col1  col2row1     4    80row2    17    80row3    68    58

To select a Series you have to select a single column with all or range of rows. Each line of code will produce the same output:

df['col0']df.loc[:,'col0']df.iloc[:, 0]Out[8]: row0    24row1    40row2    83row3    92row4    78Name: col0, dtype: int32

The colon means that we want to select all rows or all columns — df.loc[:,:] or df.iloc[:,:] will return all values. And slowly we’ve got to the slicing — selecting specific ranges from our data. To slice a Series you just add a range of rows you want to select using its indexes:

df['col3'][2:5]Out[12]: row2    26row3    93row4    70Name: col3, dtype: int32

And don’t forget about ranging in Python — first element included, second excluded. So the code above will return rows with indexes 5, 6, 7, 8 and 9. And indexes start from 0.

Slicing DataFrames works the same way. With just one nuance. When using .loc (labels) both borders are included. For example, select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and all columns:

df.loc['row1':'row4', :]Out[20]:       col0  col1  col2  col3  col4row1    40     4    80    12    84row2    83    17    80    26    15row3    92    68    58    93    33row4    78    63    35    70    95
df.iloc[1:4, :]Out[21]:       col0  col1  col2  col3  col4row1    40     4    80    12    84row2    83    17    80    26    15row3    92    68    58    93    33

The first line of code above selected row1, row2, row3 and row4. While the second — row1, row2 and row3 only. And few more examples below.

Select columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4 and all rows:

df.loc[:, 'col1':'col4']Out[22]:       col1  col2  col3  col4row0    78    42     7    96row1     4    80    12    84row2    17    80    26    15row3    68    58    93    33row4    63    35    70    95
df.iloc[:, 1:4]Out[23]:       col1  col2  col3row0    78    42     7row1     4    80    12row2    17    80    26row3    68    58    93row4    63    35    70

Select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4:

df.loc['row1':'row4', 'col1':'col4']Out[24]:       col1  col2  col3  col4row1     4    80    12    84row2    17    80    26    15row3    68    58    93    33row4    63    35    70    95
df.iloc[1:4,1:4]Out[25]:       col1  col2  col3row1     4    80    12row2    17    80    26row3    68    58    93

Use a list to select specific columns or rows that are not in a range.

df.loc['row2':'row4', ['col1','col3']]Out[28]:       col1  col3row2    17    26row3    68    93row4    63    70
df.iloc[[2,4], 0:4]Out[30]:       col0  col1  col2  col3row2    83    17    80    26row4    78    63    35    70

Filtering DataFrames

Filtering is a more general tool to select parts of the data based on properties of interest of the data itself and not on indexes or labels. DataFrames have several methods for filtering. Underlying idea for all these methods is a Boolean Series. The df[‘col1’] > 20 (we assume col1 is of type integer) will return a Boolean Series where this condition is true. I will put here the output of .head() method, so you don’t need to scroll up to match the numbers.

Out[2]:       col0  col1  col2  col3  col4row0    24    78    42     7    96row1    40     4    80    12    84row2    83    17    80    26    15row3    92    68    58    93    33row4    78    63    35    70    95

So, to select part of a DataFrame where values of col1 are bigger than 20 we will use the following code:

df[df['col1'] > 20]# assigning variable also workscondition = df['col1'] > 20df[condition]Out[31]:       col0  col1  col2  col3  col4row0    24    78    42     7    96row3    92    68    58    93    33row4    78    63    35    70    95

We can combine those filters using standard logical operators (and — &, or — |, not — ~). Notice usage of parenthesis for these operations.

df[(df['col1'] > 25) & (df['col3'] < 30)] # logical andOut[33]:       col0  col1  col2  col3  col4row0    24    78    42     7    96
df[(df['col1'] > 25) | (df['col3'] < 30)] # logical orOut[34]:       col0  col1  col2  col3  col4row0    24    78    42     7    96row1    40     4    80    12    84row2    83    17    80    26    15row3    92    68    58    93    33row4    78    63    35    70    95
df[~(df['col1'] > 25)] # logical notOut[35]:       col0  col1  col2  col3  col4row1    40     4    80    12    84row2    83    17    80    26    15

Dealing with 0 and NaN values

Almost always datasets have zero or NaN values and we definitely want to know where they are. Ours is particular, so we will modify it a little:

df.iloc[3, 3] = 0df.iloc[1, 2] = np.nandf.iloc[4, 0] = np.nandf['col5'] = 0df['col6'] = np.NaNdf.head()Out[57]:       col0  col1  col2  col3  col4  col5  col6row0  24.0    78  42.0     7    96     0   NaNrow1  40.0     4   NaN    12    84     0   NaNrow2  83.0    17  80.0    26    15     0   NaNrow3  92.0    68  58.0     0    33     0   NaNrow4   NaN    63  35.0    70    95     0   NaN

To select columns that don’t have any zero value we can use .all() method (all data is present):

df.loc[:, df.all()]Out[43]:       col0  col1  col2  col4  col6row0  24.0    78  42.0    96   NaNrow1  40.0     4   NaN    84   NaNrow2  83.0    17  80.0    15   NaNrow3  92.0    68  58.0    33   NaNrow4   NaN    63  35.0    95   NaN

If we want to find a column that have at least one nonzero (any) value, this will help:

df.loc[:, df.any()]Out[47]:       col0  col1  col2  col3  col4row0  24.0    78  42.0     7    96row1  40.0     4   NaN    12    84row2  83.0    17  80.0    26    15row3  92.0    68  58.0     0    33row4   NaN    63  35.0    70    95

To select columns with any NaN:

df.loc[:, df.isnull().any()]Out[48]:       col0  col2  col6row0  24.0  42.0   NaNrow1  40.0   NaN   NaNrow2  83.0  80.0   NaNrow3  92.0  58.0   NaNrow4   NaN  35.0   NaN

Select columns without NaNs:

df.loc[:, df.notnull().all()]Out[49]:       col1  col3  col4  col5row0    78     7    96     0row1     4    12    84     0row2    17    26    15     0row3    68     0    33     0row4    63    70    95     0

We can drop those rows containing NaNs, but it’s a dangerous game — dropping usually isn’t a solution. You have to understand your data and deal with such rows wisely. I warned you.

Are you sure you wanna know it? OK.. ?

df.dropna(how='all', axis=1) # if all values in a column are NaN it will be droppedOut[69]:       col0  col1  col2  col3  col4  col5row0  24.0    78  42.0     7    96     0row1  40.0     4   NaN    12    84     0row2  83.0    17  80.0    26    15     0row3  92.0    68  58.0     0    33     0row4   NaN    63  35.0    70    95     0
df.dropna(how='any', axis=1) # if any value in a row is NaN it will be droppedOut[71]:       col1  col3  col4  col5row0    78     7    96     0row1     4    12    84     0row2    17    26    15     0row3    68     0    33     0row4    63    70    95     0

This methods do not modify original DataFrame, so to continue working with filtered data you have to assign it to new dataframe or reassign to the existing one

df = df.dropna(how='any', axis=1)

The beauty of filtering is that we actually can select or modify values of one column based on another. For example, we can select values from col1 where col2 is grater than 35 and update those values by adding 5 to each:

# Find a column based on anotherdf['col1'][df['col2'] > 35]Out[74]: row0    78row2    17row3    68Name: col1, dtype: int32# Modify a column based on anotherdf['col1'][df['col2'] > 35] += 5df['col1']Out[77]: row0    83row1     4row2    22row3    73row4    63Name: col1, dtype: int32

And this brings us to the next part –

Transforming DataFrames

Once we selected or filtered our data we want to transform it somehow. The best way to do this is with methods inherited to DataFrames or numpy universal funcs, that transform entire column of data element-wise. Examples would be pandas’ .floordiv() function (from documentation: ‘Integer division of dataframe and other, element-wise’) or numpy’s .floor_divide() (doc: ‘Return the largest integer smaller or equal to the division of the inputs.’).

If those functions were not available we could write our own and use it with .apply() method.

def some_func(x):     return x * 2
df.apply(some_func) -- # update each entry of a DataFrame without any loops
# lambda also works df.apply(lambda n: n*2) -- # the same

These functions do not return transformations, so we have to store it explicitly:

df['new_col'] = df['col4'].apply(lambda n: n*2)df.head()Out[82]:       col0  col1  col2  col3  col4  col5  col6  new_colrow0  24.0    83  42.0     7    96     0   NaN      192row1  40.0     4   NaN    12    84     0   NaN      168row2  83.0    22  80.0    26    15     0   NaN       30row3  92.0    73  58.0     0    33     0   NaN       66row4   NaN    63  35.0    70    95     0   NaN      190

If index is a string it has a .str accessor that allows us to modify entire index at once:

df.index.str.upper()Out[83]: Index(['ROW0', 'ROW1', 'ROW2', 'ROW3', 'ROW4'], dtype='object')

Also, we cannot use .apply() method on index — the alternative for it is .map()

df.index = df.index.map(str.lower)Out[85]: Index(['row0', 'row1', 'row2', 'row3', 'row4'], dtype='object')

But .map() can be used on columns as well. For example:

# Create the dictionary: red_vs_bluered_vs_blue = {0:'blue', 12:'red'}# Use the dictionary to map the 'col3' column to the new column df['color']df['color'] = df['col3'].map(red_vs_blue)df.head()Out[92]:       col0  col1  col2  col3  col4  col5  col6  new_col colorrow0  24.0    83  42.0     7    96     0   NaN      192   NaNrow1  40.0     4   NaN    12    84     0   NaN      168   redrow2  83.0    22  80.0    26    15     0   NaN       30   NaNrow3  92.0    73  58.0     0    33     0   NaN       66  bluerow4   NaN    63  35.0    70    95     0   NaN      190   NaN

Arithmetic operations on Series and DataFrames work directly. The expression below will create a new column where each value with index n is a sum of values with index n from ‘col3’ and ‘col7’.

df['col7'] = df['col3'] + df['col4'] df.head()Out[94]:       col0  col1  col2  col3  col4  col5  col6  new_col color  col7row0  24.0    83  42.0     7    96     0   NaN      192   NaN   103row1  40.0     4   NaN    12    84     0   NaN      168   red    96row2  83.0    22  80.0    26    15     0   NaN       30   NaN    41row3  92.0    73  58.0     0    33     0   NaN       66  blue    33row4   NaN    63  35.0    70    95     0   NaN      190   NaN   165

This is the second version of the article, because the first one was a complete mess — errors in the code, no examples and few other things. Thanks to the feedback I went through the article one more time and I think it looks much better now. I have covered basics of transforming and extracting data in Python with code snippets and examples here and hopefully it will be useful for people who are just starting their path in this field.

Meanwhile, love data science and smile more. We have to be positive as we have the sexiest job of 21st century ?