1. 程式人生 > >Python對商品屬性進行二次分類並輸出多層巢狀字典

Python對商品屬性進行二次分類並輸出多層巢狀字典

題目有點長,感覺好像也解釋的不太清楚,但是大概意思就是,我們在逛一個網站的時候,譬如天貓,你會看到有“女裝”、“男鞋”、“手機”等等分類,點選進去又會有相應的品牌,女裝下面會有“snidle”、“伊芙麗”等品牌,男鞋下面會有“nike”、“adidas”等分類,如果一個使用者在搜尋nike,那麼相應的標籤應該會帶上“男鞋”,通俗的說是會在輸入框下面彈出“在男鞋下面搜尋nike”,那麼我寫這篇文章就是要預測我們在輸入一個品牌的時候,相對應的一級分類的概率是多少。
然並卵,我並沒有天貓的相關資料,只有我公司的資料,但是這個資料肯定不能外洩,編資料又很麻煩,所以就不講怎麼用機器學習的演算法去計算這個概率了,不過這也不難,待我有時間寫個爬蟲把資料弄下來再寫,嘿嘿。
總之,做完後的預測資料應該是醬紫的:
天貓商品二次分類預測值

這個表怎麼看呢,第一行是一級分類的類別,第一列是二級分類的類別。以第三行為例,我們可以看到“scofield”這個品牌被分類為“女裝/內衣”的概率是0.87473829,“女鞋/男鞋/箱包”的概率是0.03394293,“化妝品/個人護理”的概率是0.21392374。所以如果你在天貓的搜尋框裡搜尋“scofield”,下面最可能彈出來的是“在女裝/內衣中搜索scofield”。
但是這個表有個缺陷,就是0值太多,而且沒有排序,看起來很亂,所以我們用python中的字典進行排序。
廢話不多說,上程式碼:

#coding:utf-8
import numpy as np
import pandas as pd
from
odo import odo from odo import convert import json from operator import itemgetter import collections from collections import OrderedDict import sys reload(sys) sys.setdefaultencoding('utf8') #載入資料集 result = pd.read_table('tmalltest.txt',header =None) listall = odo(result,list) result1 = pd.read_table('tmalltest.txt'
) result2 = result1.drop('class',axis = 1) listvalue = odo(result2,list) count = len(range(result.shape[1])) id = result.iloc[0,1:16] listvalue out = [result.iloc[y,0] for y in range(1,result.shape[0])] outid =tuple(out) d = [dict(zip(id,tuple(listvalue[i]))) for i in range(0,len(listvalue))] #將字典的鍵值對反轉 func = lambda b:dict([(x,y) for y,x in b.items()]) dd = [func(d[i]) for i in range(len(d)) ] #刪除字典中key為0的鍵值對 delete = [dd[i].pop(0.0) for i in range(len(d))] #將字典反轉回來 ddvalue = [func(dd[i]) for i in range(len(d))] #兩個列表合成dict dictall = dict(zip(out,ddvalue)) #使輸出到控制檯的時候顯示的是中文 print json.dumps(dictall).decode("unicode-escape") #將字典中的值取出來,放到一個新列表中 lista = [] for k in dictall.keys(): sorted_d =sorted(dictall[k].iteritems(),key = itemgetter(1),reverse = True) print sorted_d lista.append(sorted_d) #只選取預測值排前三的類別 listb = [lista[i][0:3] for i in range(len(lista))] listc = [json.dumps(listb[i]).decode("unicode-escape") for i in range(len(listb))] #二級分類排序,可以用OrderedDict有序字典排序 dictorder = [OrderedDict(lista[i]) for i in range(0,len(lista))] print json.dumps(dictorder).decode("unicode-escape") #將排序號的列表重新組合成字典 dictall_sort= dict(zip(dictall.keys(),listc)) #寫個函式使輸出巢狀字典更美觀 def pretty_dict(obj, indent=' '): def _pretty(obj, indent): for i, tup in enumerate(obj.items()): k, v = tup #如果是字串則拼上"" if isinstance(k, basestring): k = '"%s"'% k if isinstance(v, basestring): v = '"%s"'% v #如果是字典則遞迴 if isinstance(v, dict): v = ''.join(_pretty(v, indent + ' '* len(str(k) + ': {')))#計算下一層的indent #case,根據(k,v)對在哪個位置確定拼接什麼 if i == 0:#開頭,拼左花括號 if len(obj) == 1: yield '{%s: %s}'% (k, v) else: yield '{%s: %s,\n'% (k, v) elif i == len(obj) - 1:#結尾,拼右花括號 yield '%s%s: %s}'% (indent, k, v) else:#中間 yield '%s%s: %s,\n'% (indent, k, v) print ''.join(_pretty(obj, indent)) #輸出原始未排序的字典,美化後 print pretty_dict(dictall) #輸出排序後的字典,美化前 print json.dumps(dictall_sort).decode("unicode-escape") #輸出排序後的字典,美化後 print pretty_dict(dictall_sort)
    輸出結果:
#輸出原始未排序的字典,美化後
{"太平鳥": {"男裝/戶外運動/": 0.847823719,
               "家紡/家飾/鮮花": "0",
               "化妝品/個人護理": 0.11242904,
               "腕錶/珠寶飾品/眼鏡": 0.05923729},
 "博士倫": {"家紡/家飾/鮮花": "0",
               "化妝品/個人護理": 0.11323213,
               "醫藥保健": 0.89348974},
 "a02": {"家紡/家飾/鮮花": "0",
         "女裝/內衣": 0.984447322,
         "女鞋/男鞋/箱包": 0.12493492},
 "周黑鴨": {"零食/進口食品/茶酒": 0.87323123,
               "家紡/家飾/鮮花": "0",
               "廚具/收納/寵物": 0.12432232},
 "3M": {"家紡/家飾/鮮花": "0",
        "廚具/收納/寵物": 0.32344534,
        "家居建材": 0.68213814},
 "博士": {"家紡/家飾/鮮花": "0"},
 "sk-II": {"家紡/家飾/鮮花": "0",
           "化妝品/個人護理": 0.98843487,
           "腕錶/珠寶飾品/眼鏡": 0.02324442},
 "洗潔精": {"圖書音像": 0.02124194,
               "家紡/家飾/鮮花": "0"},
 "finity": {"家紡/家飾/鮮花": "0",
            "女裝/內衣": 0.93392424,
            "女鞋/男鞋/箱包": 0.07323483},
 "selected": {"男裝/戶外運動/": 0.934439842,
              "家紡/家飾/鮮花": "0",
              "女鞋/男鞋/箱包": 0.07438472},
 "scofield": {"家紡/家飾/鮮花": "0",
              "化妝品/個人護理": 0.21392374,
              "女鞋/男鞋/箱包": 0.03394293,
              "女裝/內衣": 0.87473829},
 "米其林": {"家紡/家飾/鮮花": "0.02432412",
               "汽車/配件/用品": 0.98233342},
 "好奇": {"零食/進口食品/茶酒": 0.11321412,
            "母嬰玩具": 0.89472934,
            "家紡/家飾/鮮花": "0"},
 "佐卡伊": {"家紡/家飾/鮮花": "0",
               "化妝品/個人護理": 0.13232944,
               "腕錶/珠寶飾品/眼鏡": 0.87342324},
 "波司登": {"母嬰玩具": 0.02134243,
               "家紡/家飾/鮮花": "0",
               "女裝/內衣": 0.78765673,
               "化妝品/個人護理": 0.20183924},
 "breadbutter": {"零食/進口食品/茶酒": 0.29434974,
                 "家紡/家飾/鮮花": "0",
                 "女鞋/男鞋/箱包": 0.03329473,
                 "女裝/內衣": 0.684728232},
 "北極絨": {"家紡/家飾/鮮花": "0.84932498",
               "大家電/生活電器": 0.05213923,
               "家居建材": 0.11321321},
 "Adidas": {"男裝/戶外運動/": 0.829743434,
            "家紡/家飾/鮮花": "0",
            "女鞋/男鞋/箱包": 0.14974892,
            "手機/數碼/電腦辦公": 0.04232553},
 "噹噹網": {"圖書音像": 0.78947234,
               "家紡/家飾/鮮花": "0"},
 "snidle": {"家紡/家飾/鮮花": "0",
            "女裝/內衣": 0.83927289,
            "女鞋/男鞋/箱包": 0.15237234,
            "腕錶/珠寶飾品/眼鏡": 0.02432324},
 "TISSOT": {"家紡/家飾/鮮花": "0",
            "大家電/生活電器": 0.13942309,
            "腕錶/珠寶飾品/眼鏡": 0.87545234},
 "曼妮芬": {"家紡/家飾/鮮花": "0",
               "化妝品/個人護理": 0.07239742,
               "女裝/內衣": 0.93837427},
 "New Balance": {"母嬰玩具": 0.43237442,
                 "家紡/家飾/鮮花": "0",
                 "女鞋/男鞋/箱包": 0.57823432},
 "Jackjones": {"男裝/戶外運動/": 0.883293743,
               "家紡/家飾/鮮花": "0",
               "女鞋/男鞋/箱包": 0.10343298,
               "手機/數碼/電腦辦公": 0.02234927},
 "ZARA": {"女鞋/男鞋/箱包": 0.12429483,
          "家紡/家飾/鮮花": "0",
          "女裝/內衣": 0.78283128,
          "腕錶/珠寶飾品/眼鏡": 0.10213943},
 "海爾": {"家紡/家飾/鮮花": "0。1323243",
            "廚具/收納/寵物": 0.09354832,
            "大家電/生活電器": 0.79103821},
 "nike": {"男裝/戶外運動/": 0.891232313,
          "家紡/家飾/鮮花": "0",
          "化妝品/個人護理": 0.06163211,
          "手機/數碼/電腦辦公": 0.04293713},
 "雙立人": {"家紡/家飾/鮮花": "0",
               "廚具/收納/寵物": 0.98943242,
               "醫藥保健": 0.01943242},
 "蘋果": {"手機/數碼/電腦辦公": 0.89232342,
            "家紡/家飾/鮮花": "0",
            "汽車/配件/用品": 0.05293713,
            "腕錶/珠寶飾品/眼鏡": 0.05230971},
 "蘭芝": {"家紡/家飾/鮮花": "0",
            "女裝/內衣": 0.09238374,
            "化妝品/個人護理": 0.78423234,
            "腕錶/珠寶飾品/眼鏡": 0.13213232}}

#輸出排序後的字典,美化前
{"太平鳥": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.8478237190000001], ["化妝品/個人護理", 0.11242904]]", "博士倫": "[["家紡/家飾/鮮花", "0"], ["醫藥保健", 0.89348974], ["化妝品/個人護理", 0.11323213]]", "a02": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.9844473220000001], ["女鞋/男鞋/箱包", 0.12493492]]", "周黑鴨": "[["家紡/家飾/鮮花", "0"], ["零食/進口食品/茶酒", 0.87323123], ["廚具/收納/寵物", 0.12432232]]", "3M": "[["家紡/家飾/鮮花", "0"], ["家居建材", 0.68213814], ["廚具/收納/寵物", 0.32344534]]", "博士": "[["家紡/家飾/鮮花", "0"]]", "sk-II": "[["家紡/家飾/鮮花", "0"], ["化妝品/個人護理", 0.98843487], ["腕錶/珠寶飾品/眼鏡", 0.02324442]]", "洗潔精": "[["家紡/家飾/鮮花", "0"], ["圖書音像", 0.02124194]]", "finity": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.93392424], ["女鞋/男鞋/箱包", 0.07323483]]", "selected": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.9344398420000001], ["女鞋/男鞋/箱包", 0.07438472]]", "scofield": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.87473829], ["化妝品/個人護理", 0.21392374]]", "米其林": "[["家紡/家飾/鮮花", "0.02432412"], ["汽車/配件/用品", 0.98233342]]", "好奇": "[["家紡/家飾/鮮花", "0"], ["母嬰玩具", 0.89472934], ["零食/進口食品/茶酒", 0.11321412]]", "佐卡伊": "[["家紡/家飾/鮮花", "0"], ["腕錶/珠寶飾品/眼鏡", 0.87342324], ["化妝品/個人護理", 0.13232944]]", "波司登": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.78765673], ["化妝品/個人護理", 0.20183924]]", "breadbutter": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.684728232], ["零食/進口食品/茶酒", 0.29434974]]", "北極絨": "[["家紡/家飾/鮮花", "0.84932498"], ["家居建材", 0.11321321], ["大家電/生活電器", 0.05213923]]", "Adidas": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.8297434340000001], ["女鞋/男鞋/箱包", 0.14974892]]", "噹噹網": "[["家紡/家飾/鮮花", "0"], ["圖書音像", 0.78947234]]", "snidle": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.83927289], ["女鞋/男鞋/箱包", 0.15237234]]", "TISSOT": "[["家紡/家飾/鮮花", "0"], ["腕錶/珠寶飾品/眼鏡", 0.87545234], ["大家電/生活電器", 0.13942309]]", "曼妮芬": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.93837427], ["化妝品/個人護理", 0.07239742]]", "New Balance": "[["家紡/家飾/鮮花", "0"], ["女鞋/男鞋/箱包", 0.57823432], ["母嬰玩具", 0.43237442]]", "Jackjones": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.883293743], ["女鞋/男鞋/箱包", 0.10343298]]", "ZARA": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.78283128], ["女鞋/男鞋/箱包", 0.12429483]]", "海爾": "[["家紡/家飾/鮮花", "0。1323243"], ["大家電/生活電器", 0.79103821], ["廚具/收納/寵物", 0.09354832]]", "nike": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.8912323129999999], ["化妝品/個人護理", 0.06163211]]", "雙立人": "[["家紡/家飾/鮮花", "0"], ["廚具/收納/寵物", 0.98943242], ["醫藥保健", 0.01943242]]", "蘋果": "[["家紡/家飾/鮮花", "0"], ["手機/數碼/電腦辦公", 0.89232342], ["汽車/配件/用品", 0.05293713]]", "蘭芝": "[["家紡/家飾/鮮花", "0"], ["化妝品/個人護理", 0.78423234], ["腕錶/珠寶飾品/眼鏡", 0.13213232]]"}

#輸出排序後的字典,美化後
{"太平鳥": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.8478237190000001], ["化妝品/個人護理", 0.11242904]]",
 "博士倫": "[["家紡/家飾/鮮花", "0"], ["醫藥保健", 0.89348974], ["化妝品/個人護理", 0.11323213]]",
 "a02": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.9844473220000001], ["女鞋/男鞋/箱包", 0.12493492]]",
 "周黑鴨": "[["家紡/家飾/鮮花", "0"], ["零食/進口食品/茶酒", 0.87323123], ["廚具/收納/寵物", 0.12432232]]",
 "3M": "[["家紡/家飾/鮮花", "0"], ["家居建材", 0.68213814], ["廚具/收納/寵物", 0.32344534]]",
 "博士": "[["家紡/家飾/鮮花", "0"]]",
 "sk-II": "[["家紡/家飾/鮮花", "0"], ["化妝品/個人護理", 0.98843487], ["腕錶/珠寶飾品/眼鏡", 0.02324442]]",
 "洗潔精": "[["家紡/家飾/鮮花", "0"], ["圖書音像", 0.02124194]]",
 "finity": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.93392424], ["女鞋/男鞋/箱包", 0.07323483]]",
 "selected": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.9344398420000001], ["女鞋/男鞋/箱包", 0.07438472]]",
 "scofield": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.87473829], ["化妝品/個人護理", 0.21392374]]",
 "米其林": "[["家紡/家飾/鮮花", "0.02432412"], ["汽車/配件/用品", 0.98233342]]",
 "好奇": "[["家紡/家飾/鮮花", "0"], ["母嬰玩具", 0.89472934], ["零食/進口食品/茶酒", 0.11321412]]",
 "佐卡伊": "[["家紡/家飾/鮮花", "0"], ["腕錶/珠寶飾品/眼鏡", 0.87342324], ["化妝品/個人護理", 0.13232944]]",
 "波司登": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.78765673], ["化妝品/個人護理", 0.20183924]]",
 "breadbutter": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.684728232], ["零食/進口食品/茶酒", 0.29434974]]",
 "北極絨": "[["家紡/家飾/鮮花", "0.84932498"], ["家居建材", 0.11321321], ["大家電/生活電器", 0.05213923]]",
 "Adidas": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.8297434340000001], ["女鞋/男鞋/箱包", 0.14974892]]",
 "噹噹網": "[["家紡/家飾/鮮花", "0"], ["圖書音像", 0.78947234]]",
 "snidle": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.83927289], ["女鞋/男鞋/箱包", 0.15237234]]",
 "TISSOT": "[["家紡/家飾/鮮花", "0"], ["腕錶/珠寶飾品/眼鏡", 0.87545234], ["大家電/生活電器", 0.13942309]]",
 "曼妮芬": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.93837427], ["化妝品/個人護理", 0.07239742]]",
 "New Balance": "[["家紡/家飾/鮮花", "0"], ["女鞋/男鞋/箱包", 0.57823432], ["母嬰玩具", 0.43237442]]",
 "Jackjones": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.883293743], ["女鞋/男鞋/箱包", 0.10343298]]",
 "ZARA": "[["家紡/家飾/鮮花", "0"], ["女裝/內衣", 0.78283128], ["女鞋/男鞋/箱包", 0.12429483]]",
 "海爾": "[["家紡/家飾/鮮花", "01323243"], ["大家電/生活電器", 0.79103821], ["廚具/收納/寵物", 0.09354832]]",
 "nike": "[["家紡/家飾/鮮花", "0"], ["男裝/戶外運動/", 0.8912323129999999], ["化妝品/個人護理", 0.06163211]]",
 "雙立人": "[["家紡/家飾/鮮花", "0"], ["廚具/收納/寵物", 0.98943242], ["醫藥保健", 0.01943242]]",
 "蘋果": "[["家紡/家飾/鮮花", "0"], ["手機/數碼/電腦辦公", 0.89232342], ["汽車/配件/用品", 0.05293713]]",
 "蘭芝": "[["家紡/家飾/鮮花", "0"], ["化妝品/個人護理", 0.78423234], ["腕錶/珠寶飾品/眼鏡", 0.13213232]]"}
    這裡結果顯示的不太好看,其實在linux下輸出很清晰,看圖片:

天貓商品預測二次分類的排序結果

這個的難點在於python的多層巢狀字典的輸出和刪除python字典中的值,譬如在這裡就是刪除字典中value = 0的值,我最開始的時候是把value值提取出來放到一個列表裡去刪除,但是刪除之後至少還會保留一個0值,後來想到可以把字典的key和value反轉,用dict.pop刪除key = 0的鍵值對就可以了。第二個難點就是多層巢狀字典的排序。我們知道字典是無序的,所以只能把字典按照value排序,然後把排序後的結果存到一個list裡,在和原來對應的key值列表組合成字典,這樣就方便多了。
記錄一下上週的工作,以後忘記了回來再看,如果大家有更好的方法,歡迎交流~
ps:這個天貓資料是我編的,如果需要我可以分享出來 = =