爬取網易雲音樂評論過萬歌曲

阿新 • • 發佈：2019-01-30

看到網上其他同學的思路是爬取所有歌單，然後篩選出評論過萬的歌曲。但我覺得不同歌單之間會有交叉，這種方式可能效率不高，而且可能會有漏網之魚。所以我準備爬取所有歌手，再爬取他們的熱門50單曲，從中篩選評論過萬的歌曲。現階段幾乎沒有歌手有超過50首評論過萬的歌曲，所以這種方法目前是可行的。

檢視歌手頁面，歌手被分成了華語男歌手、華語女歌手、歐美男歌手……共計15個板塊，板塊代號如下：

group = ['1001', '1002', '1003', '2001', '2002', '2003', '6001', '6002', '6003', '7001', '7002', '7003', '4001', '4002', '4003']

而每個板塊又按照首字母分成了27個子頁面（包括熱門歌手頁面），子頁面代號如下：

initial = ['0']
for i in range(65, 91):
    initial.append(str(i))

15*27=405，我們要爬取405個歌手子頁面，可以利用上述代號拼接出這405個歌手子頁面連結：

urls = []
for g in group:
    for i in initial:
        url = 'http://music.163.com/discover/artist/cat?id=' + g + '&initial=' + i
        urls.append(url)

然後就是用爬蟲從這些頁面上爬取歌手的id：

def get_artist 
(url):
    k = 0
    t = []
    while True:
        try:
            resp = request.urlopen(url)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
            l = soup.find_all('a', class_='nm nm-icn f-thide s-fc0')
            p = r'\s*\/[a-z]+\?[a-z]+\=([0-9]+)' 

            for i in l:
                t.append(re.match(p, i['href']).group(1))
            return t
        except Exception as e:
            print(e)
            k += 1
            if k > 10:
                print('頁面' + url + '發生錯誤')
                return None
            t = []
            continue

獲得歌手id以後，再讓爬蟲爬取歌手的個人頁面，獲取熱門50單曲的歌曲id：

def get_song(artist_id):
    k = 0
    t = []
    while True:
        url = 'http://music.163.com/artist?id=' + artist_id
        try:
            req = request.Request(url)
            req.add_header('User-Agent',
                           'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4399.400 QQBrowser/9.7.12777.400')
            req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
            resp = request.urlopen(req)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
        except Exception as e:
            k += 1
            if k > 10:
                print('歌手' + artist_id + '發生錯誤')
                print(e)
                return None
            continue
        try:
            a = soup.find('ul', class_='f-hide')
            l = a.children
            p = r'\s*\/[a-z]+\?[a-z]+\=([0-9]+)'
            for i in l:
                music_id = re.match(p, i.a['href']).group(1)
                data = (music_id, artist_id)
                t.append(data)
            return t
        except Exception as e:
            print(e)
            print('歌手' + artist_id + '發生錯誤')
            return None

利用歌曲id訪問歌曲頁面，獲取歌曲評論數，這裡遇到了難點。評論資訊都是動態載入的，直接獲取評論數的結果總是0，所以這裡借鑑了知乎使用者平胸小仙女的回答，方法如下：

# -*- coding: utf-8 -*-
from Crypto.Cipher import AES
import base64
import requests
import json
import codecs
import time
import random

#代理ip
proxy_host = '122.72.18.35'
proxy = {'http':proxy_host}

# 頭部資訊
headers={'Host':'music.163.com',
         'Accept':'*/*',
         'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
         'Accept-Encoding':'gzip, deflate',
         'Content-Type':'application/x-www-form-urlencoded',
         'Referer':'http://music.163.com/song?id=347597',
         'Content-Length':'484',
         'Cookie':'__s_=1; _ntes_nnid=f17890f7160fd145486752ebbf2066df,1505221478108; _ntes_nuid=f17890f7160fd145486752ebbf2066df; JSESSIONID-WYYY=Z99pE%2BatJVOAGco1d%2FJpojOK94Xe9GHqe0epcCOj23nqP2SlHt1XwzWQ2FXTwaM2xgIN628qJGj8%2BikzfYkv%2FXAUo%2FSzwMxjdyO9oeQlGKBvH6nYoFpJpVlA%2F8eP57fkZAVEsuB9wqkVgdQc2cjIStE1vyfE6SxKAlA8r0sAgOnEun%2BV%3A1512200032388; _iuqxldmzr_=32; __utma=94650624.1642739310.1512184312.1512184312.1512184312.1; __utmc=94650624; __utmz=94650624.1512184312.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); playerid=10841206',
         'Connection':'keep-alive',
         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}

# offset的取值為:(評論頁數-1)*20,total第一頁為true，其餘頁為false
first_param = '{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}' 
second_param = "010001" 
third_param = "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
forth_param = "0CoJUm6Qyw8W8jud"

# 獲取引數
def get_params(page): # page為傳入頁數
    iv = "0102030405060708"
    first_key = forth_param
    second_key = 16 * 'F'
    if(page == 1): # 如果為第一頁
        first_param = '{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}'
        h_encText = AES_encrypt(first_param, first_key, iv)
    else:
        offset = str((page-1)*20)
        first_param = '{rid:"", offset:"%s", total:"%s", limit:"20", csrf_token:""}' %(offset,'false')
        h_encText = AES_encrypt(first_param, first_key, iv)
    h_encText = AES_encrypt(h_encText, second_key, iv)
    return h_encText

# 獲取 encSecKey
def get_encSecKey():
    encSecKey = "257348aecb5e556c066de214e531faadd1c55d814f9be95fd06d6bff9f4c7a41f831f6394d5a3fd2e3881736d94a02ca919d952872e7d0a50ebfa1769a7a62d512f5f1ca21aec60bc3819a9c3ffca5eca9a0dba6d6f7249b06f5965ecfff3695b54e1c28f3f624750ed39e7de08fc8493242e26dbc4484a01c76f739e135637c"
    return encSecKey

# 解密過程
def AES_encrypt(text, key, iv):
    pad = 16 - len(text) % 16
    text = text + pad * chr(pad)
    encryptor = AES.new(key, AES.MODE_CBC, iv)
    encrypt_text = encryptor.encrypt(text)
    encrypt_text = base64.b64encode(encrypt_text)
    encrypt_text = str(encrypt_text, encoding="utf-8") #注意一定要加上這一句，沒有這一句則出現錯誤
    return encrypt_text

def get_json(url, params, encSecKey):
    data = {
         "params": params,
         "encSecKey": encSecKey
    }
    response = requests.post(url, headers=headers, data=data, proxies=proxy)
    return response.content

#外部呼叫方法
def get_comments_total(id):
    url = 'http://music.163.com/weapi/v1/resource/comments/R_SO_4_'+str(id)+'?csrf_token='
    params = get_params(1)
    encSecKey = get_encSecKey()
    json_text = get_json(url,params,encSecKey)
    json_dict = json.loads(json_text)
    comments_num = int(json_dict['total'])
    return comments_num

最後再將獲得的資料逐條寫入資料庫就可以了
總的程式碼如下：

# _*_ coding: utf-8 _*_
from urllib import request
import requests
import json
from bs4 import BeautifulSoup as bs
from Crypto.Cipher import AES
import base64
import re
import mysql.connector
import get_comments_total as gct
import threading




group = ['1001', '1002', '1003', '2001', '2002', '2003', '6001', '6002', '6003', '7001', '7002', '7003', '4001', '4002',
         '4003']

initial = ['0']
for i in range(65, 91):
    initial.append(str(i))

urls = []
for g in group:
    for i in initial:
        url = 'http://music.163.com/discover/artist/cat?id=' + g + '&initial=' + i
        urls.append(url)

#寫入資料庫
def write(L):
    try:
        conn = mysql.connector.connect(user='root', password='lixiao187.', database='cloudmusic', charset='utf8')
        cursor = conn.cursor()
        for l in L:
            try:
                cursor.execute(
                    'insert into music(music_id, music_name, artist_id, artist_name, comments) values (%s, %s, %s, %s, %s)',
                    l)
                conn.commit()
            except Exception as e:
                print(e)
                print(l)
                continue
        cursor.close()
        conn.close()
    except Exception as e:
        print(e)
        print(L)


# 獲得某字母頁面上的歌手id列表
def get_artist(url):
    k = 0
    t = []
    while True:
        try:
            resp = request.urlopen(url)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
            l = soup.find_all('a', class_='nm nm-icn f-thide s-fc0')
            p = r'\s*\/[a-z]+\?[a-z]+\=([0-9]+)'
            for i in l:
                t.append(re.match(p, i['href']).group(1))
            return t
        except Exception as e:
            print(e)
            k += 1
            if k > 10:
                print('頁面' + url + '發生錯誤')
                return None
            t = []
            continue


# 獲得某歌手的熱門歌曲id列表
def get_song(artist_id):
    k = 0
    t = []
    while True:
        url = 'http://music.163.com/artist?id=' + artist_id
        try:
            req = request.Request(url)
            req.add_header('User-Agent',
                           'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4399.400 QQBrowser/9.7.12777.400')
            req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
            resp = request.urlopen(req)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
        except Exception as e:
            k += 1
            if k > 10:
                print('歌手' + artist_id + '發生錯誤')
                print(e)
                return None
            continue
        try:
            a = soup.find('ul', class_='f-hide')
            l = a.children
            p = r'\s*\/[a-z]+\?[a-z]+\=([0-9]+)'
            for i in l:
                music_id = re.match(p, i.a['href']).group(1)
                data = (music_id, artist_id)
                t.append(data)
            return t
        except Exception as e:
            print(e)
            print('歌手' + artist_id + '發生錯誤')
            return None

# 獲得全部所需資訊
def get_data(music_id, artist_id):
    k = 0
    while True:
        try:
            comments = gct.get_comments_total(music_id)
            print('歌曲'+music_id+'，評論數：'+str(comments))
            if comments < 10000:
                return None
            url = 'http://music.163.com/song?id=' + music_id
            data = []
            req = request.Request(url)
            req.add_header('User-Agent',
                           'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4399.400 QQBrowser/9.7.12777.400')
            req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
            resp = request.urlopen(req)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
            d = soup.find('div', class_='tit')
            p = soup.find('p', class_='des s-fc4')
            s = soup.find('span', class_='j-flag')
            music_name = d.em.text
            artist_name = p.span['title']
            data.append(music_id)
            data.append(music_name)
            data.append(artist_id)
            data.append(artist_name)
            data.append(comments)
            return data
        except Exception as e:
            k += 1
            if k > 10:
                print('歌曲' + music_id + '發生錯誤')
                return None
            continue

# 逐條寫入
def get_and_write(artists, name):
    data = []
    for a in artists:
        songs = get_song(a)
        if songs == None:
            continue
        for s in songs:
            d = get_data(s[0], a)
            if d == None:
                continue
            data.append(d)
    if len(data) > 0:
        write(data)

# 歌手子頁面爬取執行緒
def crawl(url, name):
    L = []
    artists = get_artist(url)
    if artists == None:
        return
    for a in artists:
        L.append(a)
        if len(L) > 9:
            t = threading.Thread(target=get_and_write, args=(L, ''))
            t.start()
            L = []
    t = threading.Thread(target=get_and_write, args=(L, ''))
    t.start()


# 總方法
def threads_crawl(start, end):
    L = []
    for i in range(start - 1, end):
        t = threading.Thread(target=crawl, args=(urls[i], ''))
        L.append(t)
    for t in L:
        t.start()
    for t in L:
        t.join()


threads_crawl(1, 405)

爬取網易雲音樂評論過萬歌曲

看到網上其他同學的思路是爬取所有歌單，然後篩選出評論過萬的歌曲。但我覺得不同歌單之間會有交叉，這種方式可能效率不高，而且可能會有漏網之魚。所以我準備爬取所有歌手，再爬取他們的熱門50單曲，從中篩選評論過萬的歌曲。現階段幾乎沒有歌手有超過50首評論過萬的歌曲，所以

爬取網易雲音樂評論並使用詞雲展示

referer top readlines target ner ads 詞雲 pos 參考最近聽到一首很喜歡的歌，許薇的《我以為》，評論也很有趣，遂有想爬取該歌曲下的所有評論並用詞雲工具展示。我們使用chrome開發者工具，發現歌曲的評論都隱藏在以 R_S

python爬取網易雲音樂評論

前言上篇爬取喜馬拉雅FM音訊的最後也提到過，這回我們爬取的就是網易雲音樂的熱評+評論。本人用了挺久的網易雲，也是非常喜歡…閒話不多說，跟著我的思路來看看如何爬取網易雲的熱評+評論~ 目標本次我們爬取的目標是–網易雲音樂歌曲的熱評以及普通評論我們

爬取網易雲音樂評論

Intro 一直想自己動手用框架搭起來一個搜尋引擎，但是也一直不知道從哪裡開始下手比較好。最近一直在網易雲音樂上聽歌，決定從網易雲上把評論全部爬下來，用評論做一個垂直搜尋 Path 說幹就開始吧首先第一步得先把網易雲上的評論爬下來吧，沒有評論

爬取網易雲音樂(包括歌詞和評論)

輸入 random 字節 sim main dal 需要 ssi wow # http://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&off

python爬取網易雲音樂歌曲評論信息

webkit fun 數據包 cond bubuko ret value selenium apple 　　網易雲音樂是廣大網友喜聞樂見的音樂平臺，區別於別的音樂平臺的最大特點，除了“它比我還懂我的音樂喜好”、“小清新的界面設計”就是它獨有的評論區了——————各種故事匯

Scrapy爬取網易雲音樂和評論（一、思路分析）

目錄：前提： scrapy這個框架很多人用過，網上教程也很多，但大多就是爬爬小說這種比較簡單且有規律的，網易雲音樂也有很多人寫過，也有API，不過大多是爬取了熱門歌曲，或是從歌單下手，但是考慮到歌單會有很多重複的。當然，從歌手頁的話，如果

Python爬取網易雲音樂熱門評論

import requests import json def get_hot_comments(res): comments_json = json.loads(res.text) hot_comments = comments_json['hotComm

爬蟲入門——用python爬取網易雲音樂熱門歌手評論數

本文參考Monkey_D_Newdun 的文章用爬蟲獲取網易雲音樂熱門歌手評論數執行平臺：Windows 10IDE：spyderPython版本：3.6瀏覽器：360一、爬蟲基本思路a. 通過URL或者檔案獲取網頁：開啟網頁-F12-找到需要獲取的url，request h

python爬取網易雲音樂歌單音樂

string attrs default textarea bsp color read contents dom 在網易雲音樂中第一頁歌單的url：http://music.163.com/#/discover/playlist/ 依次第二頁：http://music.1

我用Python爬取網易雲音樂上的Hip-hop歌單，分析rapper如何押韻

line gone 謠言大致 -i 態度大眾其中當前緣起《中國有嘻哈》這個節目在這個夏天吸引了無數的目光，也讓嘻哈走進了大眾的視野。作為我今年看的唯一一個綜藝節目，它對我的影響也蠻大。這個夏天，我基本都在杭州度過，在上下班的taxi上，我幾乎都在刷這個節目，最後

如何用Python網絡爬蟲爬取網易雲音樂歌曲

今天 http 分享圖片分享圖片分分鐘參考 down 技術今天小編帶大家一起來利用Python爬取網易雲音樂，分分鐘將網站上的音樂down到本地。跟著小編運行過代碼的筒子們將網易雲歌詞抓取下來已經不再話下了，在抓取歌詞的時候在函數中傳入了歌手ID和歌曲名兩個參數

如何用Python網絡爬蟲爬取網易雲音樂歌詞

網易雲歌詞 Python網絡爬蟲網絡爬蟲前幾天小編給大家分享了數據可視化分析，在文尾提及了網易雲音樂歌詞爬取，今天小編給大家分享網易雲音樂歌詞爬取方法。本文的總體思路如下：找到正確的URL，獲取源碼；利用bs4解析源碼，獲取歌曲名和歌曲ID；調用網易雲歌曲API，獲取歌詞；將歌詞寫入

python3.基礎爬取網易雲音樂【超詳細版】

簡單學習了python爬蟲之後，我們就可以嘿咻嘿咻了...因為平時就是用網易雲聽的歌，也喜歡看歌裡的評論，所以就爬網易雲音樂評論吧！正式進入主題首先還是去找目標網頁並開始分析網頁結構，如下上面的三個箭頭都是所要找的資料，分別是評論使用者，評論和點贊數，都可以用正則表示式找出來，接下來繼續找怎樣

爬取網易雲音樂所有歌單資訊

可以結合下一篇文章實現歌曲下載 python 爬蟲下載網易歌單歌曲使用 python + requests + lxml + selenium 使用 requests 發起請求，獲取到所有分類的 url 使用 selenium 傳送請求取到

python爬取網易雲音樂資料

1.首先匯入2個第三方庫，json庫是標準庫，用到的有Requests庫，Beautisoup庫，json庫 2.分析網站，當然是f12 開發者工具了，firefox瀏覽器的開發者工具個人用著比chrome的好用一點。用開發者工具之前要先明白你要找什麼資料，我想抓取的是霹靂布袋戲的

爬取網易雲音樂MP3連結指令碼

程式碼部分 #Python 3.5 #Author: A_lPha #Blog: http://blog.csdn.net/a_lpha import json from urllib.request import urlopen from ur

Python爬取網易雲音樂歌單內所有歌曲

一、目標：下載網易雲音樂熱門歌單二、用到的模組： requests，multiprocessing，re。三、步驟：（1）頁面分析：首先開啟網易雲音樂，選擇熱門歌單，可以看到以下歌單列表，然後開啟開發者工具本人對於Python學習建立了一個小小的學習圈子，為

python爬取網易雲音樂，python下載網易雲音樂

import requests import time import os from urllib import request from bs4 import BeautifulSoup import urllib class Wy: page = 0 wymusic = {}

爬取網易雲音樂“三部曲”（一）：爬取歌手資訊！

提到歌神張學友，大家可能不會陌生或者說是如雷貫耳，他可是有著逃犯殺手之稱，這不明天1月11號是他2019世界巡迴演唱會《香港站》的開辦日期，不知香港警方有沒有做好抓逃犯的準備【手動滑稽】。對於歌神明天的演唱會，小編其實挺嚮往的，只是奈何年底了，天天要工作，作為一個程式猿，這也是沒辦法的，為了排遣內心

爬取網易雲音樂評論過萬歌曲

相關推薦