1. 程式人生 > >第七章 食物資料庫例子

第七章 食物資料庫例子

In [3]: import json

In [4]: db=json.load(open('E:/foods-2011-10-03.json'))

In [5]: len(db)
Out[5]: 6636

In [6]:
    db中的每條目都是一個含有某種食物全部資料的字典,nutrients是字典的一個列表,其中每個字典對應一種營養成分:
db[0].keys()
Out[6]: 
[u'portions',
 u'description',
 u'tags',
 u'nutrients',
 u'group',
 u'id',
 u'manufacturer']
In [7]: db[0]['nutrients'][0]

Out[7]:

{u'description': u'Protein',

u'group': u'Composition',

u'units': u'g',

u'value': 25.18}

In [8]: from pandas import DataFrame

In [9]: nutrients=DataFrame(db[0]['nutrients'])

In [11]: nutrients[:7]

Out[11]:

description group units value

0 Protein Composition g 25.18

1 Total lipid (fat) Composition g 29.20

2 Carbohydrate, by difference Composition g 3.06

3 Ash Other g 3.28

4 Energy Energy kcal 376.00

5 Water Composition g 39.28

6 Energy Energy kJ 1573.00

    將字典列表轉換為DataFrame時,可以只抽取其中一部分欄位,將抽取食物的名稱、分類、編號以及製造商資訊:

In [12]: info_keys=['description','group','id','manufacturer']

In [13]: info=DataFrame(db,columns=info_keys)

In [14]: info[:5]
Out[14]: 
                          description                   group    id  
0                     Cheese, caraway  Dairy and Egg Products  1008   
1                     Cheese, cheddar  Dairy and Egg Products  1009   
2                        Cheese, edam  Dairy and Egg Products  1018   
3                        Cheese, feta  Dairy and Egg Products  1019   
4  Cheese, mozzarella, part skim milk  Dairy and Egg Products  1028   
  manufacturer  
0               
1               
2               
3               
4 
import pandas as pd

#通過value_counts,檢視食物類別分佈情況

pd.value_counts(info.group)[:10]
Out[20]:
Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Legumes and Legume Products          365
Fast Foods                           365
Lamb, Veal, and Game Products        345
Sweets                               341
Fruits and Fruit Juices              328
Pork Products                        328
Name: group, dtype: int64#==============================================================================
# 將所有食物的營養成分整合到一個大表中:
#1、將各食物的營養成分列表轉換為一個DataFrame,並新增一個表示編號的列
#2、該DataFrame新增到一個表中
#3、通過concat將這些東西連線起來
#==============================================================================
nutrients=[]
for rec in db:
    fnuts=DataFrame(rec['nutrients'])
    fnuts['id']=rec['id']
    nutrients.append(fnuts)
nutrients=pd.concat(nutrients,ignore_index=True)
nutrients[:10]
Out[23]:

description group units value id

0 Protein Composition g 25.18 1008

1 Total lipid (fat) Composition g 29.20 1008

2 Carbohydrate, by difference Composition g 3.06 1008

3 Ash Other g 3.28 1008

4 Energy Energy kcal 376.00 1008

5 Water Composition g 39.28 1008

6 Energy Energy kJ 1573.00 1008

7 Fiber, total dietary Composition g 0.00 1008

8 Calcium, Ca Elements mg 673.00 1008

9 Iron, Fe Elements mg 0.64 1008

In [24]: #重複項

In [25]: nutrients.duplicated().sum()

Out[25]: 14179

In [26]: nutrients=nutrients.drop_duplicates()

In [27]: col_mapping={'description':'food','group':'fgroup'}

In [28]: info=info.rename(columns=col_mapping,copy=False)

In [29]: info[:10]

Out[29]:

food fgroup \

0 Cheese, caraway Dairy and Egg Products

1 Cheese, cheddar Dairy and Egg Products

2 Cheese, edam Dairy and Egg Products

3 Cheese, feta Dairy and Egg Products

4 Cheese, mozzarella, part skim milk Dairy and Egg Products

5 Cheese, mozzarella, part skim milk, low moisture Dairy and Egg Products

6 Cheese, romano Dairy and Egg Products

7 Cheese, roquefort Dairy and Egg Products

8 Cheese spread, pasteurized process, american, ... Dairy and Egg Products

9 Cream, fluid, half and half Dairy and Egg Products

id manufacturer

0 1008

1 1009

2 1018

3 1019

4 1028

5 1029

6 1038

7 1039

8 1048

9 1049

In [64]: col_mapping={'description':'nutrient','group':'nutgroup'}

In [65]: info=nutrients.rename(columns=col_mapping,copy=False)

In [66]: info[:10]

Out[66]:

nutrient nutgroup units value id

0 Protein Composition g 25.18 1008

1 Total lipid (fat) Composition g 29.20 1008

2 Carbohydrate, by difference Composition g 3.06 1008

3 Ash Other g 3.28 1008

4 Energy Energy kcal 376.00 1008

5 Water Composition g 39.28 1008

6 Energy Energy kJ 1573.00 1008

7 Fiber, total dietary Composition g 0.00 1008

8 Calcium, Ca Elements mg 673.00 1008

9 Iron, Fe Elements mg 0.64 1008

In [67]: #可以將info1和info合併起來

In [68]: ndata=pd.merge(info,info1,on='id',how='outer')

In [69]: ndata[:10]

Out[69]:

nutrient nutgroup units value id \

0 Protein Composition g 25.18 1008

1 Total lipid (fat) Composition g 29.20 1008

2 Carbohydrate, by difference Composition g 3.06 1008

3 Ash Other g 3.28 1008

4 Energy Energy kcal 376.00 1008

5 Water Composition g 39.28 1008

6 Energy Energy kJ 1573.00 1008

7 Fiber, total dietary Composition g 0.00 1008

8 Calcium, Ca Elements mg 673.00 1008

9 Iron, Fe Elements mg 0.64 1008

food fgroup manufacturer

0 Cheese, caraway Dairy and Egg Products

1 Cheese, caraway Dairy and Egg Products

2 Cheese, caraway Dairy and Egg Products

3 Cheese, caraway Dairy and Egg Products

4 Cheese, caraway Dairy and Egg Products

5 Cheese, caraway Dairy and Egg Products

6 Cheese, caraway Dairy and Egg Products

7 Cheese, caraway Dairy and Egg Products

8 Cheese, caraway Dairy and Egg Products

9 Cheese, caraway Dairy and Egg Products

In [70]: ndata.ix[30000]

Out[70]:

nutrient Glycine

nutgroup Amino Acids

units g

value 0.04

id 6158

food Soup, tomato bisque, canned, condensed

fgroup Soups, Sauces, and Gravies

manufacturer

Name: 30000, dtype: object

In [71]: result=ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5)

import matplotlib
%matplotlib inline
result['Zinc, Zn'].order().plot(kind='barh')

In [85]: by_nutrient=ndata.groupby(['nutgroup','nutrient'])

In [86]: get_maxinum=lambda x:x.xs(x.value.idxmax())

In [87]: get_mininum=lambda x:x.xs(x.value.idxmin())

In [88]: max_foods=by_nutrient.apply(get_maxinum)[['value','food']]

In [89]: #讓food小一點

In [90]: max_foods.food=max_foods.food.str[:50]

In [91]: max_foods.ix['Amino Acids']['food']

Out[91]:

nutrient

Alanine Gelatins, dry powder, unsweetened

Arginine Seeds, sesame flour, low-fat

Aspartic acid Soy protein isolate

Cystine Seeds, cottonseed flour, low fat (glandless)

Glutamic acid Soy protein isolate

Glycine Gelatins, dry powder, unsweetened

Histidine Whale, beluga, meat, dried (Alaska Native)

Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...

Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...

Methionine Fish, cod, Atlantic, dried and salted

Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Proline Gelatins, dry powder, unsweetened

Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Tryptophan Sea lion, Steller, meat with fat (Alaska Native)

Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...

Name: food, dtype: object