1. 程式人生 > >資料分析--用R語言預測離職(上)

資料分析--用R語言預測離職(上)

資料分析–用R語言預測離職(上)

資料可以直接下載,欄位都是英文的,部分欄位描述如下:

變數型別 變數名 描述 取值範圍
結果變數 Attrition 員工是否流失 Yes, No
自變數 Age 年齡 數值
BusinessTravel 出差 1.Non-Travel, 2.Travel_Rarely 3.Travel_Frequently
Department 部門 1.Sales 2.Research & Development 3.Human Resources
DistanceFromHome 公司到家的距離 數值
Education 學歷 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’
EducationField 學歷領域
EnvironmentSatisfaction 環境滿意度 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
Gender 性別 1.Male 2.Female
JobInvolvement 工作投入 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
JobLevel 職位等級
JobRole 職位
JobSatisfaction 工作滿意度 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
MaritalStatus 是否結婚 1.Single 2.Married 3.Divorced
MonthlyIncome 月收入 數值
NumCompaniesWorked 任職過的企業數量 數值
OverTime 是否加班 Yes, No
PercentSalaryHike 漲薪百分比 數值
PerformanceRating 績效評分 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
RelationshipSatisfaction 關係滿意度 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
StockOptionLevel 員工優先認股權 數值
TotalWorkingYears 工齡 數值
TrainingTimesLastYear 上一年培訓次數 數值
WorkLifeBalance 工作與生活平衡情況 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’
YearsAtCompany 在公司工作時間 數值
YearsInCurrentRole 當前職位的工作時間 數值
YearsSinceLastPromotion 距離上次升職的時間 數值
YearsWithCurrManager 與當前經理工作的時間 數值

資料讀取

讀取資料之後,summary一下,觀察變數
(注意一點:在讀取資料的時候,stringsAsFactors = T,因為資料裡面有字串的變數)

> attr.df <- read.csv("HR-Employee-Attrition.csv",header = T,stringsAsFactors = T)
> summary(attr.df)
      Age        Attrition            BusinessTravel   DailyRate                       Department  DistanceFromHome
 Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0   Human Resources       : 63   Min.   : 1.000  
 1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0   Research & Development:961   1st Qu.: 2.000  
 Median :36.00              Travel_Rarely    :1043   Median : 802.0   Sales                 :446   Median : 7.000  
 Mean   :36.92                                       Mean   : 802.5                                Mean   : 9.193  
 3rd Qu.:43.00                                       3rd Qu.:1157.0                                3rd Qu.:14.000  
 Max.   :60.00                                       Max.   :1499.0                                Max.   :29.000  

   Education              EducationField EmployeeCount EmployeeNumber   EnvironmentSatisfaction    Gender   
 Min.   :1.000   Human Resources : 27    Min.   :1     Min.   :   1.0   Min.   :1.000           Female:588  
 1st Qu.:2.000   Life Sciences   :606    1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000           Male  :882  
 Median :3.000   Marketing       :159    Median :1     Median :1020.5   Median :3.000                       
 Mean   :2.913   Medical         :464    Mean   :1     Mean   :1024.9   Mean   :2.722                       
 3rd Qu.:4.000   Other           : 82    3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000                       
 Max.   :5.000   Technical Degree:132    Max.   :1     Max.   :2068.0   Max.   :4.000                       

   HourlyRate     JobInvolvement    JobLevel                          JobRole    JobSatisfaction  MaritalStatus
 Min.   : 30.00   Min.   :1.00   Min.   :1.000   Sales Executive          :326   Min.   :1.000   Divorced:327  
 1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000   Research Scientist       :292   1st Qu.:2.000   Married :673  
 Median : 66.00   Median :3.00   Median :2.000   Laboratory Technician    :259   Median :3.000   Single  :470  
 Mean   : 65.89   Mean   :2.73   Mean   :2.064   Manufacturing Director   :145   Mean   :2.729                 
 3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000   Healthcare Representative:131   3rd Qu.:4.000                 
 Max.   :100.00   Max.   :4.00   Max.   :5.000   Manager                  :102   Max.   :4.000                 
                                                 (Other)                  :215                                 
 MonthlyIncome    MonthlyRate    NumCompaniesWorked Over18   OverTime   PercentSalaryHike PerformanceRating
 Min.   : 1009   Min.   : 2094   Min.   :0.000      Y:1470   No :1054   Min.   :11.00     Min.   :3.000    
 1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000               Yes: 416   1st Qu.:12.00     1st Qu.:3.000    
 Median : 4919   Median :14236   Median :2.000                          Median :14.00     Median :3.000    
 Mean   : 6503   Mean   :14313   Mean   :2.693                          Mean   :15.21     Mean   :3.154    
 3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                          3rd Qu.:18.00     3rd Qu.:3.000    
 Max.   :19999   Max.   :26999   Max.   :9.000                          Max.   :25.00     Max.   :4.000    

 RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
 Min.   :1.000            Min.   :80    Min.   :0.0000   Min.   : 0.00     Min.   :0.000         Min.   :1.000  
 1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000   1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000  
 Median :3.000            Median :80    Median :1.0000   Median :10.00     Median :3.000         Median :3.000  
 Mean   :2.712            Mean   :80    Mean   :0.7939   Mean   :11.28     Mean   :2.799         Mean   :2.761  
 3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000  
 Max.   :4.000            Max.   :80    Max.   :3.0000   Max.   :40.00     Max.   :6.000         Max.   :4.000  

 YearsAtCompany   YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
 Min.   : 0.000   Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
 1st Qu.: 3.000   1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
 Median : 5.000   Median : 3.000     Median : 1.000          Median : 3.000      
 Mean   : 7.008   Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
 3rd Qu.: 9.000   3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
 Max.   :40.000   Max.   :18.000     Max.   :15.000          Max.   :17.000      

我們的資料總共有1470行,35列
上面Attrition是我們研究的變數:代表是否離職的意思
從上面我們可以看出:
1.離職的人數佔總人數的 16%左右;
2.月收入平均為:6503,中值為:4919,其中中值更能代表薪資水平
3.加班的人數佔總人數的28%(Overtime欄位)

資料分析及視覺化

下面我們來看下離職的人和各個變數之間的關係:

> library(ggplot2)
> library(gridExtra)
> g1 <- ggplot(attr.df, aes(x=Age,fill=Attrition))+
+   geom_density(alpha = 0.7)
> g2 <- ggplot(attr.df, aes(x=DistanceFromHome, fill=Attrition))+
+   geom_density(alpha = 0.7)
> g3 <- ggplot(attr.df, aes(x=MonthlyIncome, fill=Attrition))+
+   geom_density(alpha = 0.7)
> g4 <- ggplot(attr.df, aes(x=NumCompaniesWorked, fill= Attrition))+
+   geom_density(alpha = 0.7)
> g5 <- ggplot(attr.df, aes(x=TotalWorkingYears, fill= Attrition))+
+   geom_density(alpha = 0.7)
> g6 <- ggplot(attr.df, aes(x=TrainingTimesLastYear, fill= Attrition))+
+   geom_density(alpha = 0.7)
> g7 <- ggplot(attr.df, aes(x=YearsAtCompany, fill= Attrition))+
+   geom_density(alpha = 0.7)
> g8 <- ggplot(attr.df, aes(x=YearsInCurrentRole, fill= Attrition))+
+   geom_density(alpha = 0.7)
> g9 <- ggplot(attr.df, aes(x=YearsWithCurrManager, fill= Attrition))+
+   geom_density(alpha = 0.7)
> grid.arrange(g1,g2,g3,g4,g5,g6,g7,g8,g9, ncol = 3, nrow = 3)

離職和各個變數之間的關係
這裡選擇的9個變數,來做核密度曲線:
其中我們可以看出
1.從年齡上面看30歲左右的人員是離職的高峰,
2.從離家距離來看,10英里意外的人員離職的概率會比較大
3.低收入的人員離職概率較大
4.在任職公司超過5個的離職概率較大
5.工齡在5年以下的離職率要高

其可能的原因在於年輕的員工更傾向於多嘗試,且對未來目標相對迷茫,高流失率也意味著此類員工難以在短期形成對企業價值觀的長期認同。