1. 程式人生 > >機器學習工程師 - Udacity 專案 0: 預測你的下一道世界料理

機器學習工程師 - Udacity 專案 0: 預測你的下一道世界料理

第一步. 下載並匯入資料

1.1 資料集:https://www.kaggle.com/c/whats-cooking/data

1.2 載入資料

# 匯入依賴庫
import json
import codecs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 載入資料集
train_filename='train.json'
train_content = pd.read_json(codecs.open(train_filename, mode='r
', encoding='utf-8')) test_filename = 'test.json' test_content = pd.read_json(codecs.open(test_filename, mode='r', encoding='utf-8')) # 列印載入的資料集數量 print("菜名資料集一共包含 {} 訓練資料 和 {} 測試樣例。\n".format(len(train_content), len(test_content))) if len(train_content)==39774 and len(test_content)==9944:
print("資料成功載入!") else: print("資料載入有問題,請檢查檔案路徑!")

菜名資料集一共包含 39774 訓練資料 和 9944 測試樣例。
資料成功載入!

1.3 資料預覽
為了檢視我們的資料集的分佈和菜品總共的種類,我們打印出部分資料樣例。

pd.set_option('display.max_colwidth',120)

程式設計練習
你需要通過head()函式來預覽訓練集train_content資料。(輸出前5條)

### TODO:列印train_content中前5個數據樣例以預覽資料
print(train_content.head())

cuisine id \
0 greek 10259
1 southern_us 25693
2 filipino 20130
3 indian 22213
4 indian 13162

ingredients
0 [romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese...
1 [plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil...
2 [eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so...
3 [water, vegetable oil, wheat, salt]
4 [black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch...

## 檢視總共菜品分類
categories=np.unique(train_content['cuisine'])
print("一共包含 {} 種菜品,分別是:\n{}".format(len(categories),categories))

一共包含 20 種菜品,分別是:
['brazilian' 'british' 'cajun_creole' 'chinese' 'filipino' 'french' 'greek'
'indian' 'irish' 'italian' 'jamaican' 'japanese' 'korean' 'mexican'
'moroccan' 'russian' 'southern_us' 'spanish' 'thai' 'vietnamese']

 

第二步. 分析資料
由於這個專案的最終目標是建立一個預測世界菜系的模型,我們需要將資料集分為特徵(Features)和目標變數(Target Variables)。

特徵: 'ingredients',給我們提供了每個菜品所包含的佐料名稱。
目標變數:'cuisine',是我們希望預測的菜系分類。
他們分別被存在 train_ingredients 和 train_targets 兩個變數名中。

程式設計練習:資料提取
將train_content中的ingredients賦值到train_integredients
將train_content中的cuisine賦值到train_targets

### TODO:將特徵與目標變數分別賦值
train_ingredients = train_content['ingredients']
train_targets = train_content['cuisine']

### TODO: 列印結果,檢查是否正確賦值
print(train_ingredients)
print(train_targets)

0 [romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese...
1 [plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil...
2 [eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so...
3 [water, vegetable oil, wheat, salt]
4 [black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch...
5 [plain flour, sugar, butter, eggs, fresh ginger root, salt, ground cinnamon, milk, vanilla extract, ground ginger, p...
6 [olive oil, salt, medium shrimp, pepper, garlic, chopped cilantro, jalapeno chilies, flat leaf parsley, skirt steak,...
7 [sugar, pistachio nuts, white almond bark, flour, vanilla extract, olive oil, almond extract, eggs, baking powder, d...
8 [olive oil, purple onion, fresh pineapple, pork, poblano peppers, corn tortillas, cheddar cheese, ground black peppe...
9 [chopped tomatoes, fresh basil, garlic, extra-virgin olive oil, kosher salt, flat leaf parsley]
10 [pimentos, sweet pepper, dried oregano, olive oil, garlic, sharp cheddar cheese, pepper, swiss cheese, provolone che...
11 [low sodium soy sauce, fresh ginger, dry mustard, green beans, white pepper, sesame oil, scallions, canola oil, suga...
12 [Italian parsley leaves, walnuts, hot red pepper flakes, extra-virgin olive oil, fresh lemon juice, trout fillet, ga...
13 [ground cinnamon, fresh cilantro, chili powder, ground coriander, kosher salt, ground black pepper, garlic, plum tom...
14 [fresh parmesan cheese, butter, all-purpose flour, fat free less sodium chicken broth, chopped fresh chives, gruyere...
15 [tumeric, vegetable stock, tomatoes, garam masala, naan, red lentils, red chili peppers, onions, spinach, sweet pota...
16 [greek yogurt, lemon curd, confectioners sugar, raspberries]
17 [italian seasoning, broiler-fryer chicken, mayonaise, zesty italian dressing]
18 [sugar, hot chili, asian fish sauce, lime juice]
19 [soy sauce, vegetable oil, red bell pepper, chicken broth, yellow squash, garlic chili sauce, sliced green onions, b...
20 [pork loin, roasted peanuts, chopped cilantro fresh, hoisin sauce, creamy peanut butter, chopped fresh mint, thai ba...
21 [roma tomatoes, kosher salt, purple onion, jalapeno chilies, lime, chopped cilantro]
22 [low-fat mayonnaise, pepper, salt, baking potatoes, eggs, spicy brown mustard]
23 [sesame seeds, red pepper, yellow peppers, water, extra firm tofu, broccoli, soy sauce, orange bell pepper, arrowroo...
24 [marinara sauce, flat leaf parsley, olive oil, linguine, capers, crushed red pepper flakes, olives, lemon zest, garlic]
25 [sugar, lo mein noodles, salt, chicken broth, light soy sauce, flank steak, beansprouts, dried black mushrooms, pepp...
26 [herbs, lemon juice, fresh tomatoes, paprika, mango, stock, chile pepper, onions, red chili peppers, oil]
27 [ground black pepper, butter, sliced mushrooms, sherry, salt, grated parmesan cheese, heavy cream, spaghetti, chicke...
28 [green bell pepper, egg roll wrappers, sweet and sour sauce, corn starch, molasses, vegetable oil, oil, soy sauce, s...
29 [flour tortillas, cheese, breakfast sausages, large eggs]
...
39744 [extra-virgin olive oil, oregano, potatoes, garlic cloves, pepper, salt, yellow mustard, fresh lemon juice]
39745 [quinoa, extra-virgin olive oil, fresh thyme leaves, scallion greens]
39746 [clove, bay leaves, ginger, chopped cilantro, ground turmeric, white onion, cinnamon, cardamom pods, serrano chile, ...
39747 [water, sugar, grated lemon zest, butter, pitted date, blanched almonds]
39748 [sea salt, pizza doughs, all-purpose flour, cornmeal, extra-virgin olive oil, shredded mozzarella cheese, kosher sal...
39749 [kosher salt, minced onion, tortilla chips, sugar, tomato juice, cilantro leaves, avocado, lime juice, roma tomatoes...
39750 [ground black pepper, chicken breasts, salsa, cheddar cheese, pepper jack, heavy cream, red enchilada sauce, unsalte...
39751 [olive oil, cayenne pepper, chopped cilantro fresh, boneless chicken skinless thigh, fine sea salt, low salt chicken...
39752 [self rising flour, milk, white sugar, butter, peaches in light syrup]
39753 [rosemary sprigs, lemon zest, garlic cloves, ground black pepper, vegetable broth, fresh basil leaves, minced garlic...
39754 [jasmine rice, bay leaves, sticky rice, rotisserie chicken, chopped cilantro, large eggs, vegetable oil, yellow onio...
39755 [mint leaves, cilantro leaves, ghee, tomatoes, cinnamon, oil, basmati rice, garlic paste, salt, coconut milk, clove,...
39756 [vegetable oil, cinnamon sticks, water, all-purpose flour, piloncillo, salt, orange zest, baking powder, hot water]
39757 [red bell pepper, garlic cloves, extra-virgin olive oil, feta cheese crumbles]
39758 [milk, salt, ground cayenne pepper, ground lamb, ground cinnamon, ground black pepper, pomegranate, chopped fresh mi...
39759 [red chili peppers, sea salt, onions, water, chilli bean sauce, caster sugar, garlic, white vinegar, chili oil, cucu...
39760 [butter, large eggs, cornmeal, baking powder, boiling water, milk, salt]
39761 [honey, chicken breast halves, cilantro leaves, carrots, soy sauce, Sriracha, wonton wrappers, freshly ground pepper...
39762 [curry powder, salt, chicken, water, vegetable oil, basmati rice, eggs, finely chopped onion, lemon juice, pepper, m...
39763 [fettuccine pasta, low-fat cream cheese, garlic, nonfat evaporated milk, grated parmesan cheese, corn starch, nonfat...
39764 [chili powder, worcestershire sauce, celery, red kidney beans, lean ground beef, stewed tomatoes, dried parsley, pep...
39765 [coconut, unsweetened coconut milk, mint leaves, plain yogurt]
39766 [rutabaga, ham, thick-cut bacon, potatoes, fresh parsley, salt, onions, pepper, carrots, pork sausages]
39767 [low-fat sour cream, grated parmesan cheese, salt, dried oregano, low-fat cottage cheese, butter, onions, olive oil,...
39768 [shredded cheddar cheese, crushed cheese crackers, cheddar cheese soup, cream of chicken soup, hot sauce, diced gree...
39769 [light brown sugar, granulated sugar, butter, warm water, large eggs, all-purpose flour, whole wheat flour, cooking ...
39770 [KRAFT Zesty Italian Dressing, purple onion, broccoli florets, rotini, pitted black olives, Kraft Grated Parmesan Ch...
39771 [eggs, citrus fruit, raisins, sourdough starter, flour, hot tea, sugar, ground nutmeg, salt, ground cinnamon, milk, ...
39772 [boneless chicken skinless thigh, minced garlic, steamed white rice, baking powder, corn starch, dark soy sauce, kos...
39773 [green chile, jalapeno chilies, onions, ground black pepper, salt, chopped cilantro fresh, green bell pepper, garlic...
Name: ingredients, Length: 39774, dtype: object
0 greek
1 southern_us
2 filipino
3 indian
4 indian
5 jamaican
6 spanish
7 italian
8 mexican
9 italian
10 italian
11 chinese
12 italian
13 mexican
14 italian
15 indian
16 british
17 italian
18 thai
19 vietnamese
20 thai
21 mexican
22 southern_us
23 chinese
24 italian
25 chinese
26 cajun_creole
27 italian
28 chinese
29 mexican
...
39744 greek
39745 spanish
39746 indian
39747 moroccan
39748 italian
39749 mexican
39750 mexican
39751 moroccan
39752 southern_us
39753 italian
39754 vietnamese
39755 indian
39756 mexican
39757 greek
39758 greek
39759 korean
39760 southern_us
39761 chinese
39762 indian
39763 italian
39764 mexican
39765 indian
39766 irish
39767 italian
39768 mexican
39769 irish
39770 italian
39771 irish
39772 chinese
39773 mexican
Name: cuisine, Length: 39774, dtype: object

程式設計練習:基礎統計運算
使用最頻繁的佐料前10分別有哪些?
義大利菜中最常見的10個佐料有哪些?

## TODO: 統計佐料出現次數,並賦值到sum_ingredients字典中
m = []
for i in range(len(train_ingredients)):
      m += train_ingredients[i]
sum_ingredients = pd.Series(m).value_counts().to_dict()
# Finally, plot the 10 most used ingredients
plt.style.use(u'ggplot')
fig = pd.DataFrame(sum_ingredients, index=[0]).transpose()[0].sort_values(ascending=False, inplace=False)[:10].plot(kind='barh')
fig.invert_yaxis()
fig = fig.get_figure()
fig.tight_layout()

## TODO: 統計義大利菜系中佐料出現次數,並賦值到italian_ingredients字典中
list_italian = train_content.loc[train_content['cuisine'].isin(['italian'])]['ingredients'].reset_index(drop=True)
n = []
for j in range(len(list_italian)):
    n += list_italian[j]
italian_ingredients = pd.Series(n).value_counts().to_dict()

 

第三步. 建立模型

3.1 單詞清洗
由於菜品包含的佐料眾多,同一種佐料也可能有單複數、時態等變化,為了去除這類差異,我們考慮將ingredients 進行過濾

import re
from nltk.stem import WordNetLemmatizer
import numpy as np

def text_clean(ingredients):
    #去除單詞的標點符號,只保留 a..z A...Z的單詞字元
    ingredients= np.array(ingredients).tolist()
    print("菜品佐料:\n{}".format(ingredients[9]))
    ingredients=[[re.sub('[^A-Za-z]', ' ', word) for word in component]for component in ingredients]
    print("去除標點符號之後的結果:\n{}".format(ingredients[9]))

    # 去除單詞的單複數,時態,只保留單詞的詞幹
    lemma=WordNetLemmatizer()
    ingredients=[" ".join([ " ".join([lemma.lemmatize(w) for w in words.split(" ")]) for words in component])  for component in ingredients]
    print("去除時態和單複數之後的結果:\n{}".format(ingredients[9]))
    return ingredients

print("\n處理訓練集...")
train_ingredients = text_clean(train_content['ingredients'])
print("\n處理測試集...")
test_ingredients = text_clean(test_content['ingredients'])

處理訓練集...
菜品佐料:
['chopped tomatoes', 'fresh basil', 'garlic', 'extra-virgin olive oil', 'kosher salt', 'flat leaf parsley']
去除標點符號之後的結果:
['chopped tomatoes', 'fresh basil', 'garlic', 'extra virgin olive oil', 'kosher salt', 'flat leaf parsley']
去除時態和單複數之後的結果:
chopped tomato fresh basil garlic extra virgin olive oil kosher salt flat leaf parsley

處理測試集...
菜品佐料:
['eggs', 'cherries', 'dates', 'dark muscovado sugar', 'ground cinnamon', 'mixed spice', 'cake', 'vanilla extract', 'self raising flour', 'sultana', 'rum', 'raisins', 'prunes', 'glace cherries', 'butter', 'port']
去除標點符號之後的結果:
['eggs', 'cherries', 'dates', 'dark muscovado sugar', 'ground cinnamon', 'mixed spice', 'cake', 'vanilla extract', 'self raising flour', 'sultana', 'rum', 'raisins', 'prunes', 'glace cherries', 'butter', 'port']
去除時態和單複數之後的結果:
egg cherry date dark muscovado sugar ground cinnamon mixed spice cake vanilla extract self raising flour sultana rum raisin prune glace cherry butter port

3.2 特徵提取
在該步驟中,我們將菜品的佐料轉換成數值特徵向量。考慮到絕大多數菜中都包含salt, water, sugar, butter等,採用one-hot的方法提取的向量將不能很好的對菜系作出區分。我們將考慮按照佐料出現的次數對佐料做一定的加權,即:佐料出現次數越多,佐料的區分性就越低。我們採用的特徵為TF-IDF,相關介紹內容可以參考:TF-IDF與餘弦相似性的應用(一):自動提取關鍵詞

from sklearn.feature_extraction.text import TfidfVectorizer
# 將佐料轉換成特徵向量

# 處理 訓練集
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 1),
                analyzer='word', max_df=.57, binary=False,
                token_pattern=r"\w+",sublinear_tf=False)
train_tfidf = vectorizer.fit_transform(train_ingredients).todense()

## 處理 測試集
test_tfidf = vectorizer.transform(test_ingredients)
train_targets=np.array(train_content['cuisine']).tolist()
train_targets[:10]

['greek',
'southern_us',
'filipino',
'indian',
'indian',
'jamaican',
'spanish',
'italian',
'mexican',
'italian']

程式設計練習
這裡我們為了防止前面步驟中累積的錯誤,導致以下步驟無法正常執行。我們在此檢查處理完的實驗資料是否正確,請列印train_tfidf和train_targets中前五個資料。

# 你需要通過head()函式來預覽訓練集train_tfidf,train_targets資料
print(train_tfidf[:5])
print(train_targets[:5])

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
['greek', 'southern_us', 'filipino', 'indian', 'indian']

3.3 驗證集劃分
為了在實驗中大致估計模型的精確度我們將從原本的train_ingredients 劃分出 20% 的資料用作valid_ingredients。

程式設計練習:資料分割與重排
呼叫train_test_split函式將訓練集劃分為新的訓練集和驗證集,便於之後的模型精度觀測。

從sklearn.model_selection中匯入train_test_split
將train_tfidf和train_targets作為train_test_split的輸入變數
設定test_size為0.2,劃分出20%的驗證集,80%的資料留作新的訓練集。
設定random_state隨機種子,以確保每一次執行都可以得到相同劃分的結果。(隨機種子固定,生成的隨機序列就是確定的)

### TODO:劃分出驗證集
from sklearn.model_selection import train_test_split
X_train , X_valid , y_train, y_valid = train_test_split(train_tfidf, train_targets, test_size = 0.2, random_state=0)

3.2 建立模型
呼叫 sklearn 中的邏輯迴歸模型(Logistic Regression)。

程式設計練習:訓練模型

從sklearn.linear_model匯入LogisticRegression
從sklearn.model_selection匯入GridSearchCV, 引數自動搜尋,只要把引數輸進去,就能給出最優的結果和引數,這個方法適合小資料集。
定義parameters變數:為C引數創造一個字典,它的值是從1至10的陣列;
定義classifier變數: 使用匯入的LogisticRegression建立一個分類函式;
定義grid變數: 使用匯入的GridSearchCV建立一個網格搜尋物件;將變數'classifier', 'parameters'作為引數傳至這個物件建構函式中;

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

## TODO: 建立邏輯迴歸模型
parameters = {'C':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
classifier = LogisticRegression()
grid = GridSearchCV(classifier, parameters)

grid = grid.fit(X_train, y_train)

模型訓練結束之後,我們計算模型在驗證集X_valid上預測結果,並計算模型的預測精度(與y_valid逐個比較)。

from sklearn.metrics import accuracy_score ## 計算模型的準確率

valid_predict = grid.predict(X_valid)
valid_score=accuracy_score(y_valid,valid_predict)

print("驗證集上的得分為:{}".format(valid_score))

驗證集上的得分為:0.7967316153362665

 

第四步. 模型預測(可選)

4.1 預測測試集

程式設計練習
將模型grid對測試集test_tfidf做預測,然後檢視預測結果。

### TODO:預測測試結果
predictions = grid.predict(test_tfidf)

print("預測的測試集個數為:{}".format(len(predictions)))
test_content['cuisine']=predictions
test_content.head(10)

預測的測試集個數為:9944

4.2 提交結果

## 載入結果格式
submit_frame = pd.read_csv("sample_submission.csv")
## 儲存結果
result = pd.merge(submit_frame, test_content, on="id", how='left')
result = result.rename(index=str, columns={"cuisine_y": "cuisine"})
test_result_name = "tfidf_cuisine_test.csv"
result[['id','cuisine']].to_csv(test_result_name,index=False)