關閉資料分析這個職位有前途嗎？--資料探勘（三)

資料探勘資料分析 · 發表 2018-10-18 07:12:05

摘要：通過前面對資料對簡單的分析，清洗，到目前為止我對資料分析這個職位有了一個大概的瞭解，但是對資料的認知還不足以回答題目的問題，接下來是對資料進行進一步的挖掘 1 我比較俗，有沒有前途先看錢 1.1 工資分佈按數量進行排序圖片.png 需要...

通過前面對資料對簡單的分析，清洗，到目前為止我對資料分析這個職位有了一個大概的瞭解，但是對資料的認知還不足以回答題目的問題，接下來是對資料進行進一步的挖掘

1 我比較俗，有沒有前途先看錢

1.1 工資分佈按數量進行排序

圖片.png

需要對薪資的資料進行處理，把字串的型別轉換為數字，並把薪資範圍作一個比較樂觀的處理，取最大範圍的值，並把年薪值簡單的按12個月換算成月薪

1def tran_salary(value):
 2if str(value).find("千/月")!=-1:
 3value=int(float(str(value).replace("千/月", "").split("-")[1].strip()))
 4elif str(value).find("萬/年")!=-1:
 5value=str(value).replace("萬/年", "").split("-")[1].strip()
 6value=int(float(value)*10/12)
 7elif str(value).find("萬/月")!=-1:
 8value=str(value).replace("萬/月", "").split("-")[1].strip()
 9value=int(float(value)*10)
10elif str(value).find("萬以上/月")!=-1:
11value=str(value).replace("萬以上/月", "").strip()
12value=int(float(value)*10)
13elif str(value).find("萬以上/年")!=-1:
14value=str(value).replace("萬以上/年", "").strip()
15value=int(float(value)*10/12)
16else:
17value=0
18return value
19def salary_desc(data):
20data['salary']=data['salary'].apply(lambda x:tran_salary(x))
21salaryGroup=data['salary'].groupby(data['salary'])
22salaryCount=salaryGroup.count().sort_values(ascending=False)[0:40]
23plt.figure(figsize=(22, 12))
24rects =plt.bar(x = arange(len(salaryCount.index)),height = salaryCount.values)
25plt.xticks(arange(len(salaryCount.index)),salaryCount.index,rotation=360)
26autolabel(rects)
27plt.title("工資分佈")
28plt.xlabel('工資(千/月)')
29plt.ylabel('數量')
30plt.savefig("data/工資分佈--按數量排序.jpg")

1.2 工資分佈按工資排序

圖片.png

只需要把前面的程式碼的第23行換成如下的程式碼，按薪資進行排序就可以

salaryCount=salaryGroup.count().sort_index(ascending=False)[0:40]

1.3 工資分組顯示

通過柱狀圖沒有給我們很直觀的每個工資的佔比，下面是通過餅圖的展示方式，對薪資進行“1萬以下, 1萬到2萬, 2萬到3萬, 3萬到4萬, 4萬到5萬, 5萬到10萬,10萬以上”分層顯示，可以看到90%的人薪資是在2萬以下

圖片.png

第3，4行程式碼是定義分層的範圍，還有顯示的文字，最關鍵的是第5行通過dataframe的cut的方法進行分層

1 def salary_desc_pie(data):
 2data['salary']=data['salary'].apply(lambda x:tran_salary(x))
 3bins = [ data['salary'].min(), 10, 20, 30, 40,50,100,data['salary'].max()]
 4labels = ['1萬以下', '1萬到2萬', '2萬到3萬', '3萬到4萬', '4萬到5萬', '5萬到10萬','10萬以上']
 5data['月薪分層'] = pd.cut(data['salary'], bins, labels=labels)
 6salaryGroup=data['salary'].groupby(data['月薪分層']).count()
 7labels=list(map(lambda x:"%s (%s,%s)"%(x,str(round(float(salaryGroup[x])/data['salary'].count()*100,2))+"%",str(salaryGroup[x])),labels))
 8plt.figure(figsize=(22, 12))
 9plt.pie(salaryGroup.values, labels=labels,
10labeldistance=1.1, autopct='%2.0f%%', shadow=False,
11startangle=90, pctdistance=0.6)
12plt.axis('equal')
13plt.legend(loc='upper left', bbox_to_anchor=(-0.1, 1))
14plt.savefig("data/工資分佈--餅圖.jpg")

2 具備什麼條件才能高薪

2.1 月薪2萬以上的資料顯示

通過下列的圖可以得出結論：如果你是本科畢業並具備了3-4年的網際網路，電子商務的工作經驗，那麼你去民營企業拿到這個薪資的概率很大。

圖片.png

只需要第二行程式碼對薪資的範圍過濾大於月薪20千的資料就可以

1data['salary']=data['salary'].apply(lambda x:tran_salary(x))
2data=data[data['salary'].apply(lambda x: float(x)>20)]

2.2 月薪2萬需要具備的能力

具備的能力，是通過職位描述裡面的要求進行分析，通過對職位描述的文字進行拆分，並且對停用詞進行過濾，篩選出薪資要求2萬的關鍵字，並且通過雲詞進行顯示，這樣比較直觀

圖片.png

1def gen_userdict(data_ser):
 2wfile = open('data/job_desc_dict.txt', 'w',encoding='utf-8')
 3wfile.truncate()
 4userdict = {} 
 5for index in data_ser.index:
 6cutWord=jieba.cut(str(data_ser[index]),cut_all=False)
 7for j in cutWord:
 8j=j.replace(' ', '')
 9if j != "":
10if (j in userdict):
11userdict[j] += 1
12else:
13userdict[j] = 1
14user_dict_pf=pd.DataFrame(list(userdict.items()), columns=['word', 'num'])
15stopwords=pd.read_csv("data/stopwords.txt",index_col=False,quoting=3,sep=" ",names=['stopword'],encoding='utf-8')
16user_dict_pf=user_dict_pf[~user_dict_pf['word'].isin(stopwords['stopword'])]
17user_dict_pf=user_dict_pf.sort_values(by = 'num',ascending = False)[0:100]
18cloud_text={}
19for idx,item in user_dict_pf.iterrows():
20cloud_text[item['word']]=item['num']
21wfile.write(item['word'] + ' ' + str(item['num']) + '\n') 
22gen_word_cloud(cloud_text)
23
24def gen_word_cloud(cloud_text):
25wc = WordCloud(
26background_color="white", #背景顏色
27max_words=500, #顯示最大詞數
28font_path="simhei.ttf",#使用字型
29min_font_size=15,
30max_font_size=50, 
31width=800#圖幅寬度
32)
33wc.generate_from_frequencies(cloud_text)
34wc.to_file("data/詞雲.png")

第6行通過結巴分詞對職位的描述進行拆分。

第15，16行讀取停用詞，並對前面的結果過濾停用詞

3 看看你有沒有拖後腿

看看你的工資有沒有被平均

圖片.png

第13行對工作經驗的資料進行處理

第14行對薪資進行過濾，對於超出月薪5萬的資料進行過濾，防止被平均了

第15行對工作經驗進行分組，並對分組的工資進行求平均值

1def tran_work_experience(value):
 2if str(value).find("年經驗")!=-1:
 3if str(value).find("-")!=-1:
 4value=int(float(str(value).replace("年經驗", "").split("-")[1].strip()))
 5pass
 6else:
 7value=int(float(str(value).replace("年經驗", "").strip()))
 8else:
 9value=0
10return value
11def salary_work_experience_rel(data):
12data['salary']=data['salary'].apply(lambda x:tran_salary(x))
13data['work_experience']=data['work_experience'].apply(lambda x:tran_work_experience(x))
14data=data[data['salary']<50]
15df_mean = data.groupby('work_experience')['salary'].mean()
16plt.figure(figsize=(22, 12))
17rects =plt.bar(x = arange(len(df_mean.index)),height = df_mean.values)
18plt.xticks(arange(len(df_mean.index)),df_mean.index,rotation=360)
19autolabel(rects)
20plt.title("工作經驗-工資關係")
21plt.xlabel('工作經驗(年)')
22plt.ylabel('工資(千/月)')
23plt.savefig("data/工作經驗-工資關係.jpg")

關注公眾號，回覆“51job”獲取專案程式碼

image.png

關閉 資料分析這個職位有前途嗎？--資料探勘（三)

1 我比較俗，有沒有前途先看錢

1.1 工資分佈按數量進行排序

1.2 工資分佈按工資排序

1.3 工資分組顯示

2 具備什麼條件才能高薪

2.1 月薪2萬以上的資料顯示

2.2 月薪2萬需要具備的能力

3 看看你有沒有拖後腿

您可能也會喜歡…

關閉資料分析這個職位有前途嗎？--資料探勘（三)