爬取去哪兒網北京南站驢友點評，及詞雲

阿新 • • 發佈：2019-01-15

爬取頁面截圖

這裡寫圖片描述

詞雲效果

title

這裡寫圖片描述

comment

這裡寫圖片描述

程式碼

資料抓取

# -*- encoding:utf-8 *-*
import urllib.request
from lxml import etree
import os

#獲取頁面
def get_page(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

# 獲得評論頁面資料
def get_data_comment(html):
    selector = etree.HTML(html)
    #標題 

    str_title = selector.xpath('//div[@class="comment_title"]/h2/text()')[0]
    #評論
    list_comment = selector.xpath('//div[@class="comment_content"]/p/text()')
    str_comment = ''
    for comment in list_comment:
        str_comment += comment + '  '
    print(str_title)
    return str_title.replace('\n' 
, ''), str_comment.replace('\n', '')

#把資料寫到本地（做詞雲用）
def write_data_wc(item_name, write_str):
    print(write_str)
    path_file = "./data/wc/" + item_name + ".txt"
    with open(path_file, 'w', encoding='utf8') as file:
        file.write(write_str)

# 把資料寫到本地
def write_data(write_str):
    path_file = "./data/data.txt" 

    with open(path_file, 'a', encoding='utf8') as file:
        file.write(write_str)

#抓取頁面
def craw(root_url):
    # 如果檔案存在，則刪除
    path_file = "./data/data.txt"
    if os.path.exists(path_file):
        os.remove(path_file)

    html = get_page(root_url)
    selector = etree.HTML(html)
    #獲取總頁數
    str_total_num = selector.xpath('//div[@class="b_paging"]/a[last()-1]/text()')[0]
    total_num = int(str_total_num)

    #拼接每一頁的url
    url_front = 'http://travel.qunar.com/p-oi5420182-beijingnanzhan-1-'
    list_url = []
    for i in range(1, total_num + 1):
        list_url.append(url_front + str(i))

    #獲取所有頁的評論url
    lsit_url_comment_page = []
    for url in list_url:
        html = get_page(url)
        selector = etree.HTML(html)
        lsit_url_comment_pre_page = selector.xpath('//ul[@id="comment_box"]/li/div[@class="e_comment_main"]//div[@class="e_comment_title"]/a/@href')
        for index, url_comment in enumerate(lsit_url_comment_pre_page):
            lsit_url_comment_page.append(url_comment)

    # print(lsit_url_comment_page)
    #獲取評論資訊
    str_title = ''
    str_comment = ''
    for url in lsit_url_comment_page:
        print(url)
        html = get_page(url)
        str_title_pre, str_comment_pre = get_data_comment(html)
        # 把評論資訊儲存到本地
        str_write = '@@[email protected]@' + '\t\t' + str_title_pre + '\n' + '@@[email protected]@' + '\t\t' + str_comment_pre + '\n' + '--------------------------------------------' + '\n'
        write_data(str_write)

        str_title += str_title_pre + '  '
        str_comment += str_comment_pre + '  '
    #把評論資訊儲存到本地（做雲詞用）
    write_data_wc('title', str_title.replace('\n', '').replace(' ', ''))
    write_data_wc('comment', str_comment.replace('\n', ''))

if __name__ == '__main__':
    craw(root_url = 'http://travel.qunar.com/p-oi5420182-beijingnanzhan')

詞雲

# -*- coding: utf-8 -*-
from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import jieba

#讀取本地檔案
def load_txt(item_name):
    with open('./data/wc/' + item_name +'.txt', 'r', encoding='utf8') as file_item:
        str_item = file_item.read()
    return str_item

#分詞
def fenci(str_text):
    seg_list = list(jieba.cut(str_text, cut_all=True))
    return seg_list

#關鍵詞統計
def count_keywords(item_name):
    str_item = load_txt(item_name)
    list_keywords = fenci(str_item)

    dict_keywords_item = {}
    for keyword in list_keywords:
        if len(keyword) > 1:
            if keyword not in dict_keywords_item:
                dict_keywords_item[keyword] = 1
            else:
                dict_keywords_item[keyword] += 1
    if '' in dict_keywords_item:
        del dict_keywords_item['']

    return dict_keywords_item

#詞雲
def wordcloud(item_name, mask_img):
    dict_keywords_item = count_keywords(item_name)
    image = Image.open("./img/mask/" + mask_img)
    graph = np.array(image)
    wc = WordCloud(font_path='./fonts/MSYH.TTC', background_color="black", max_words=50, mask=graph,
                   stopwords=set(STOPWORDS))
    wc.generate_from_frequencies(dict_keywords_item)
    # plot and show
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.figure()
    plt.imshow(graph, cmap=plt.cm.gray, interpolation='bilinear')
    plt.axis("off")
    # plt.show()

    # store to file
    wc.to_file(path.join("./img/" + item_name + '.png'))

#####方法呼叫
if __name__ == '__main__':
    wordcloud('title', 'mask.png')
    wordcloud('comment', 'mask.png')

爬取去哪兒網北京南站驢友點評，及詞雲

爬取頁面截圖詞雲效果 title comment 程式碼資料抓取 # -*- encoding:utf-8 *-* import urllib.request from lxml import etre

25.爬取去哪兒網的商品資料-1

1.首先分析頁面資訊頁面地址：http://touch.qunar.com/爬取度假中的自由行頻道資訊可以看到某一城市xhr獲取資訊： request.url : https://touch.dujia.qunar.com/list?modu

25.爬取去哪兒網的商品數據-2

爬取商品數據註意 alt coo url 獲取配置信息需要註意的問題：1.首先要獲取dep和query參數。2.分析請求的url地址變化，獲取routeCount參數。我配置代碼出現的問題：1.url拼接問題，網站拒絕訪問，模擬請求參數設置user-agen

25.爬取去哪兒網的商品資料-2

需要注意的問題：1.首先要獲取dep和query引數。2.分析請求的url地址變化，獲取routeCount引數。我配置程式碼出現的問題：1.url拼接問題，網站拒絕訪問，模擬請求引數設定user-agent和cookie2.獲取routeCount引數會報異常，因為有的url返回的資料

爬取去哪兒網 6000 多個景點資料告訴你，國慶哪裡不是人山人海！

國慶長假已經過去一半啦，朋友們有多少是堵在了景區和路上？為了方便大家的出遊選擇，筆者爬取了去哪兒

使用 Scrapy 爬取去哪兒網景區資訊

Scrapy 是一個使用 Python 語言開發，為了爬取網站資料，提取結構性資料而編寫的應用框架，它用途廣泛，比如：資料探勘、監測和自動化測試。安裝使用終端命令 pip install Scrapy 即可。 Scrapy 比較吸引人的地方是：我們可以根據需求對其進行修改，它提供了多種型別的爬蟲基類，如：Ba

【python學習筆記】36：抓取去哪兒網的旅遊產品資料

學習《Python3爬蟲、資料清洗與視覺化實戰》時自己的一些實踐。書上這章開篇就說了儘量找JSON格式的資料，比較方便解析（在python裡直接轉換成字典），去哪兒網PC端返回的不是JSON資料，這裡抓取的是它的移動端的資料。如果是就散落在網頁上，我覺得就像上篇學習的那

【python學習筆記】38：使用Selenium抓取去哪兒網動態頁面

學習《Python3爬蟲、資料清洗與視覺化實戰》時自己的一些實踐。在去哪兒網PC端自由行頁面，使用者需要輸入出發地和目的地，點選開始定製，然後就可以看到一系列相關的旅遊產品。在這個旅遊產品頁換頁不會改變URL，而是重新載入，這時頁碼沒有體現在URL中，這種動態頁面用傳統的爬蟲

用python來爬取中國天氣網北京，上海，成都8-15天的天氣

2 爬取北京，上海，成都的天氣 from bs4 import BeautifulSoup import random import requests import socket impo

爬取鏈家網北京房源及房價分析

爬取鏈家網北京房源及房價分析文章開始把我喜歡的這句話送個大家：這個世界上還有什麼比自己寫的程式碼執行在一億人的電腦上更酷的事情嗎，如果有那就是

爬蟲，爬取鏈家網北京二手房資訊

# 鏈家網二手房資訊爬取 import re import time import requests import pandas as pd from bs4 import BeautifulSoup url = 'http://bj.lianjia.com/ershouf

requests爬取去哪兒網站

閒來無事，所以爬下去哪兒網站的旅遊景點資訊，爬取網頁之前，最重要的是分析網頁的架構。1. 選擇要爬取的網頁及定位自己要爬取的資訊 url=http://piao.qunar.com/ 爬取全國熱門城市的境內門票首先要得到全國熱門城市的城市名及它們背後的連結2. 根據獲得

用python爬蟲爬取去哪兒4500個熱門景點，看看國慶不能去哪兒

前言：本文建議有一定Python基礎和前端(html,js)基礎的盆友閱讀。金秋九月，丹桂飄香，在這秋高氣爽，陽光燦爛的收穫季節裡，我們送走了一個個暑假餘額耗盡哭著走向校園的孩籽們，又即將迎來一年一度偉大祖國母親的生日趴體(無心上班，迫不及待想為祖國母親

爬取美團網美食資料，看北京上海都愛吃些啥

資料爬取三步曲之前方有坑工作需求需要採集 OTA 網站的美食資料，某個城市的飯店型別情況等。對於老饕來說這不算個事，然而最後的結果是午飯晚飯都沒有時間去吃了……情況如下： Chrome F12 直接定位 get 請求，response 的結

使用Python去爬取中國天氣網的近7天天氣情況

import requests from bs4 import BeautifulSoup address = 'http://www.weather.com.cn/weather/101{}.shtml' for i in range(1,24): z =

[去哪兒網]首個重復字符

ron wrap question pub item class clas tag n) 時間限制：3秒空間限制：32768K 熱度指數：33999 本題知識點：查找題目描述對於一個字符串，請設計一個高效算法，找到第一次重復出現的字符。給定一個字符串(不一定

Node.js爬蟲-爬取慕課網課程信息

reac 分享 function apt txt sta eject 賦值 find 第一次學習Node.js爬蟲，所以這時一個簡單的爬蟲，Node.js的好處就是可以並發的執行這個爬蟲主要就是獲取慕課網的課程信息，並把獲得的信息存儲到一個文件中，其中要用到cheerio

去哪兒網怎麽淪為騙子的平臺了，一步步揭開去哪兒網欺騙消費者的把戲

客服讓我支付技術發現都江堰去哪兒網接機 stat 先讓我大哭一會兒現在的去哪兒網真是牛擺哄哄，明目張膽誆騙老用戶啊。好傷心。好難過，被騙了，被坑了。之前一直在去哪兒訂機票，還沒發現有什麽不正確的地方知道今天。我才悔恨不已啊，此事還得從頭

Scrapy爬取慕課網(imooc)所有課程數據並存入MySQL數據庫

start table ise utf-8 action jpg yield star root 爬取目標：使用scrapy爬取所有課程數據，分別為 1.課程名 2.課程簡介 3.課程等級 4.學習人數並存入MySQL數據庫（目標網址 http://www.imoo

Python爬蟲之爬取煎蛋網妹子圖

創建目錄 req add 註意 not 相同 esp mpi python3 這篇文章通過簡單的Python爬蟲（未使用框架，僅供娛樂）獲取並下載煎蛋網妹子圖指定頁面或全部圖片，並將圖片下載到磁盤。首先導入模塊：urllib.request、re、os import

爬取去哪兒網北京南站驢友點評，及詞雲

爬取頁面截圖

詞雲效果

title

comment

程式碼

資料抓取

詞雲

相關推薦