“Faster, Higher, Stronger” — ML on Olympics
Data
The dataset has 15 attributes with 271k samples, which is the largest dataset I will be studying on Medium. I will be doing some gold digging as the main purpose of this study, with given information related to whether athletes win the medal or not.
Attributes:
ID -> Each athlete's unique number Name -> 134732 Unique ValuesSex AgeHeight WeightTeamNOC -> National Olympic CommitteeGames -> Year and seasonYear -> 1896 - 2016Season -> Summer or WinterCitySportEventMedal -> Gold, Silver, Bronze, or NA
Since the dataset has 271k rows, it is better to check missing values.
print(df.isnull().sum())
Here it can be seen that; Age, Height and Weight columns have missing values. These columns have immense importance on getting accurate results. So columns cannot be removed as a column, they need to be replaced.
Preprocessing
Firstly, a datum is taken in the DataFrame structure of pandas;
import pandas as pdolympics_csv = pd.read_csv('athlete_events.csv')df = pd.DataFrame(olympics_csv)
For replacing missing values, Age, Weight and Height should be filled appropriately. Also, many columns need to be numbered which are; Name, Sex, Team, NOC, Games, Season, City, Sport and Event.
Medal values numbered by pandas library’s features;
df['Medal'] = df.groupby(['Medal']).ngroup()
If an athlete wins:
- Gold -> 1
- Silver ->2
- Bronze ->3
- Loses -> -1
Weight, Age and Height replaced with mean value of the each column;
df['Weight'] = df['Weight'].fillna(df['Weight'].mean().astype(int))df['Height'] = df['Height'].fillna(df['Height'].mean().astype(int))
df['Age'] = df['Age'].fillna(df['Age'].mean().astype(int))
For other columns;
df['Name'] = df.groupby(['Name']).ngroup()df['Sex'] = df.groupby(['Sex']).ngroup()df['Team'] = df.groupby(['Team']).ngroup()df['NOC'] = df.groupby(['NOC']).ngroup()df['Games'] = df.groupby(['Games']).ngroup()df['Season'] = df.groupby(['Season']).ngroup()df['City'] = df.groupby(['City']).ngroup()df['Sport'] = df.groupby(['Sport']).ngroup()df['Event'] = df.groupby(['Event']).ngroup()
After operations, all columns are filled and numbered except the Age column;