1. 程式人生 > >關於爬取json內容生成詞雲(瘋狂踩坑)

關於爬取json內容生成詞雲(瘋狂踩坑)

.sh 動態 cnblogs google 插件 save result json數據 keys

本文爬取了掘金上關於前端前n頁的標題。將文章的標題進行分析,可以看出人們對前端關註的點或者近來的熱點。

  1. 導入庫
    import requests
    import re
    from bs4 import BeautifulSoup
    import json
    import urllib
    import jieba
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    import numpy as np
    import xlwt
    import jieba.analyse
    from PIL import Image,ImageSequence

  2. 爬取json
    #動態網頁json爬取
    response=urllib.request.urlopen(ajaxUrl)
    ajaxres=response.read().decode(utf-8)
    json_str = json.dumps(ajaxres) #編碼
    strdata = json.loads(json_str)  # 解碼
    data=eval(strdata) 

  3. 循環輸出title內容,並寫入文件
    for i in range(0,25):
        ajaxUrl = ajaxUrlBegin + str(i) + ajaxUrlLast;
        for
    i in range(0,19): result=[] result=data[d][i][title] print(result+\n) f = open(finally.txt, a, encoding=utf-8) f.write(result) f.close()

  4. 生成詞雲
    #詞頻統計
    f = open(finally.txt, r, encoding=utf-8)
    str = f.read()
    stringList = list(jieba.cut(str))
    symbol 
    = {"/", "(", ")", " ", "", "", "", "","+","?"," ","","","","","","","","","",""} stringSet = set(stringList) - symbol title_dict = {} for i in stringSet: title_dict[i] = stringList.count(i) print(title_dict) #導入excel di = title_dict wbk = xlwt.Workbook(encoding=utf-8) sheet = wbk.add_sheet("wordCount") # Excel單元格名字 k = 0 for i in di.items(): sheet.write(k, 0, label=i[0]) sheet.write(k, 1, label=i[1]) k = k + 1 wbk.save(前端數據.xls) # 保存為 wordCount.xls文件   font = rC:\Windows\Fonts\simhei.ttf content = .join(title_dict.keys()) # 根據圖片生成詞雲 image = np.array(Image.open(cool.jpg)) wordcloud = WordCloud(background_color=white, font_path=font, mask=image, width=1000, height=860, margin=2).generate(content) # 顯示生成的詞雲圖片 plt.imshow(wordcloud) plt.axis("off") plt.show() wordcloud.to_file(c-cool.jpg)

  5. 一個項目n個坑,一個坑踩一萬年
  • 獲取動態網頁的具體內容

   爬取動態網頁時標題並不能在html裏直接找到,需要通過開發者工具裏的Network去尋找。尋找到的是ajax發出的json數據。技術分享圖片

  • 獲取json裏面的具體某個數據

    我們獲取到json數據之後(通過url獲取)發現它。。

技術分享圖片

(wtf,啥玩意啊這是???)

這時我們可以用一個Google插件JSONview,用了之後發現他說人話了終於!

技術分享圖片

  • 接下來就是wordCloud的安裝

   這個我就不說了(說了之後只是網上那批沒用的答案+1.)。想知道怎麽解決的出門右轉隔壁的隔壁的隔壁老黃的博客。(芬達牛比)

  1. 總體代碼
    import requests
    import re
    from bs4 import BeautifulSoup
    import json
    import urllib
    import jieba
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    import numpy as np
    import xlwt
    import jieba.analyse
    from PIL import Image,ImageSequence
    
    url=https://juejin.im/search?query=前端
    res = requests.get(url)
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text,"html.parser")
    
    
    #遍歷n次
    ajaxUrlBegin=https://search-merger-ms.juejin.im/v1/search?query=%E5%89%8D%E7%AB%AF&page=
    ajaxUrlLast=&raw_result=false&src=web
    for i in range(0,25):
        ajaxUrl=ajaxUrlBegin+str(i)+ajaxUrlLast;
    
    #動態網頁json爬取
    response=urllib.request.urlopen(ajaxUrl)
    ajaxres=response.read().decode(utf-8)
    json_str = json.dumps(ajaxres) #編碼
    strdata = json.loads(json_str)  # 解碼
    data=eval(strdata) #str轉換為dict
    
    for i in range(0,25):
        ajaxUrl = ajaxUrlBegin + str(i) + ajaxUrlLast;
        for i in range(0,19):
            result=[]
            result=data[d][i][title]
            print(result+\n)
            f = open(finally.txt, a, encoding=utf-8)
            f.write(result)
            f.close()
    
    #詞頻統計
    f = open(finally.txt, r, encoding=utf-8)
    str = f.read()
    stringList = list(jieba.cut(str))
    symbol = {"/", "(", ")", " ", "", "", "", "","+","?"," ","","","","","","","","","",""}
    stringSet = set(stringList) - symbol
    title_dict = {}
    for i in stringSet:
        title_dict[i] = stringList.count(i)
    print(title_dict)
    
    #導入excel
    di = title_dict
    wbk = xlwt.Workbook(encoding=utf-8)
    sheet = wbk.add_sheet("wordCount")  # Excel單元格名字
    k = 0
    for i in di.items():
        sheet.write(k, 0, label=i[0])
        sheet.write(k, 1, label=i[1])
        k = k + 1
    wbk.save(前端數據.xls)  # 保存為 wordCount.xls文件  
    
    font = rC:\Windows\Fonts\simhei.ttf
    content =  .join(title_dict.keys())
    # 根據圖片生成詞雲
    image = np.array(Image.open(cool.jpg))
    wordcloud = WordCloud(background_color=white, font_path=font, mask=image, width=1000, height=860, margin=2).generate(content)
    # 顯示生成的詞雲圖片
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    wordcloud.to_file(c-cool.jpg)

    技術分享圖片

技術分享圖片 (詞雲圖)

       

關於爬取json內容生成詞雲(瘋狂踩坑)