1. 程式人生 > >“Faster, Higher, Stronger” — ML on Olympics

“Faster, Higher, Stronger” — ML on Olympics

Data

The dataset has 15 attributes with 271k samples, which is the largest dataset I will be studying on Medium. I will be doing some gold digging as the main purpose of this study, with given information related to whether athletes win the medal or not.

Attributes:
ID    -> Each athlete's unique number Name  -> 134732 Unique ValuesSex AgeHeight WeightTeamNOC    -> National Olympic CommitteeGames  -> Year and seasonYear   -> 1896 - 2016Season -> Summer or WinterCitySportEventMedal  -> Gold, Silver, Bronze, or NA
missing values chart

Since the dataset has 271k rows, it is better to check missing values.

print(df.isnull().sum())

Here it can be seen that; Age, Height and Weight columns have missing values. These columns have immense importance on getting accurate results. So columns cannot be removed as a column, they need to be replaced.

Preprocessing

Firstly, a datum is taken in the DataFrame structure of pandas;

import pandas as pdolympics_csv = pd.read_csv('athlete_events.csv')df = pd.DataFrame(olympics_csv)

For replacing missing values, Age, Weight and Height should be filled appropriately. Also, many columns need to be numbered which are; Name, Sex, Team, NOC, Games, Season, City, Sport and Event.

Medal values numbered by pandas library’s features;

df['Medal']  = df.groupby(['Medal']).ngroup()

If an athlete wins:

  • Gold -> 1
  • Silver ->2
  • Bronze ->3
  • Loses -> -1

Weight, Age and Height replaced with mean value of the each column;

df['Weight'] = df['Weight'].fillna(df['Weight'].mean().astype(int))df['Height'] = df['Height'].fillna(df['Height'].mean().astype(int))
df['Age'] = df['Age'].fillna(df['Age'].mean().astype(int))

For other columns;

df['Name']   = df.groupby(['Name']).ngroup()df['Sex']    = df.groupby(['Sex']).ngroup()df['Team']   = df.groupby(['Team']).ngroup()df['NOC']    = df.groupby(['NOC']).ngroup()df['Games']  = df.groupby(['Games']).ngroup()df['Season'] = df.groupby(['Season']).ngroup()df['City']   = df.groupby(['City']).ngroup()df['Sport']  = df.groupby(['Sport']).ngroup()df['Event']  = df.groupby(['Event']).ngroup()

After operations, all columns are filled and numbered except the Age column;