scrapy實戰爬取cl社群評論數超過設定值的連結

阿新 • • 發佈：2018-12-31

1、建立scrapy專案

scrapy startproject cl

2、前戲

　　a、註釋爬蟲檔案中的allowed_domains

　　b、settings.py第22行，ROBOTSTXT_OBEY = True改為ROBOTSTXT_OBEY = False

　　c、settings.py第19行，改為USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

　　d、開啟管道：67-69行，

　　ITEM_PIPELINES = {
　　　　'mytestscrapy.pipelines.MytestscrapyPipeline': 300,
　　　　}

3、cl.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from mytestscrapy.items import MytestscrapyItem
import time
import random

class TestCLSpider(scrapy.Spider):
    name = 'cl'
    # 
 allowed_domains = ['www.baidu.com']
    start_urls = ['https://cc.yyss.icu/thread0806.php?fid=2&search=&page=1']
    print("第1頁開始")
    url = 'https://cc.yyss.icu/thread0806.php?fid=2&search=&page=%d'
    pageNum = 1

    def parse(self, response):
        # response_text = response.text
        if 
 self.pageNum == 1:
            tr_ele=Selector(response=response).xpath('//table[@id="ajaxtable"]/tbody[@style="table-layout:fixed;"]/tr[@class="tr3 t_one tac"]')[2:]
        else:
            tr_ele=Selector(response=response).xpath('//table[@id="ajaxtable"]/tbody[@style="table-layout:fixed;"]/tr[@class="tr3 t_one tac"]')

        for tr in tr_ele:
            count = tr.xpath('./td[4]/text()').extract_first()
            #過濾評論數小於4的
            if int(count) < 4:
                continue
            text = tr.xpath('./td[2]//a/text()').extract_first()
            url = 'https://cc.yyss.icu/'+tr.xpath('./td[2]//a/@href').extract_first()
            item = MytestscrapyItem()
            item['urlname'] = text
            item['urladdr'] = url
            item['commentsNum'] = count
            yield item
        #爬取1-30頁資料
        if self.pageNum < 30:
            #每爬取一頁資料，隨機等待2-4秒
            time.sleep(random.randint(2,5))
            self.pageNum += 1
            new_url = format(self.url % self.pageNum)
            print("第%s頁開始"%self.pageNum)
            yield scrapy.Request(url=new_url,callback=self.parse)

4.items.py

import scrapy


class MytestscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    urlname = scrapy.Field()
    urladdr = scrapy.Field()
    commentsNum = scrapy.Field()

5、pipelines.py(資料存入mysql資料庫，mysql資料庫cl_table表的欄位urlname, urladdr, commentsNum)

import pymysql


class MytestscrapyPipeline(object):
    connect = ''
    cursor = ''
    def open_spider(self, spider):
        self.connect = pymysql.Connect(
            host='localhost',
            port=3306,
            user='root',
            passwd='123456',
            db='cl',
            charset='utf8'
        )
    def process_item(self, item, spider):
        urlname = item['urlname']
        urladdr = item['urladdr']
        commentsNum = item['commentsNum']
        self.cursor = self.connect.cursor()
        sql = "INSERT INTO cl_table (urlname, urladdr, commentsNum) VALUES ('%s','%s','%s' )"
        data = (urlname, urladdr, commentsNum)

        try:
            self.cursor.execute(sql % data)
        except Exception as e:
            self.connect.rollback()  # 事務回滾
            print('事務處理失敗', e)
        else:
            self.connect.commit()  # 事務提交
            print('事務處理成功', self.cursor.rowcount)
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.connect.close()

scrapy實戰爬取cl社群評論數超過設定值的連結

1、建立scrapy專案 scrapy startproject cl 2、前戲　　a、註釋爬蟲檔案中的allowed_domains 　　b、settings.py第22行，ROBOTSTXT_OBEY = True改為ROBOTSTXT_OBEY = False 　　c、settings.py

scrapy實戰爬取cl社區評論數超過設定值的鏈接

chrom lee connect ngs charset format lines back nes 1、創建scrapy項目 scrapy startproject cl 2、前戲　　a、註釋爬蟲文件中的allowed_domains 　　b、settings.py第

教程+資源,python scrapy實戰爬取知乎最性感妹子的爆照合集(12G)!

一.出發點：之前在知乎看到一位大牛（二胖）寫的一篇文章：python爬取知乎最受歡迎的妹子（大概題目是這個，具體記不清了），但是這位二胖哥沒有給出原始碼，而我也沒用過python,正好順便學一學,所以我決定自己動手搞一搞. 爬取已經完成,文末有 python的原始碼和妹子圖片的百度雲地址二.準備：

scrapy實戰爬取電影天堂相關資訊

# encoding: utf-8 import scrapy from scrapy import Selector from scrapy import Request from pacong.items import MovieNews, Mov

Python網絡爬蟲Scrapy+MongoDB +Redis實戰爬取騰訊視頻動態評論教學視頻

並發數 www. 深入圖例編程 ppt 研發 read 網絡爬蟲課程簡介學習Python爬蟲開發數據采集程序啦！網絡編程，數據采集、提取、存儲，陷阱處理……一站式全精通！！！目標人群掌握Python編程語言基礎，有誌從事網絡爬蟲開發及數據采集程序開發的人群。學習目

Scrapy:虎牙爬取，圖片存儲與數據分析

alt 數據分析 mage 加載 ram data afr frame bubuko 第一次爬取虎牙主播數據，有點小激動 1.共批量爬取的101個主播的，包括頭像主播名字房間號房間鏈接 2.數據規整部分，需要將json數據加載到pandas的Dataframe，

Scrapy爬取貓眼電影評論

Scrapy爬取貓眼電影評論文章目錄 Scrapy爬取貓眼電影評論 1、尋找評論介面 2、分析介面URL 介面URL規律構造URL介面分析JSON引數 3、Scrapy程式碼

利用scrapy爬取藝龍評論

yinlong_spider: import scrapy import urllib.request import requests import demjson from scrapy.spiders import CrawlSpider from yilong.items import

Python爬蟲【實戰篇】scrapy 框架爬取某招聘網存入mongodb

建立專案 scrapy startproject zhaoping 建立爬蟲 cd zhaoping scrapy genspider hr zhaopingwang.com 目錄結構 items.py title = scrapy.Field()

Python使用Scrapy爬蟲框架爬取天涯社群小說“大宗師”全文

大宗師是著名網路小說作家蛇從革的系列作品“宜昌鬼事”之一，在天涯論壇具有超級高的訪問量。這個長篇小說於2015年3月17日開篇，並於2016年12月29日大結局，期間每天有7萬多讀者閱讀。如果在天涯社群直接閱讀的話，會被很多讀者留言干擾，如圖於是，我寫了下面的程式碼，從

scrapy框架爬取京東商城商品的評論

一、Scrapy介紹 Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。所謂網路爬蟲，就是一個在網上到處或定向抓取資料的程式，當然，這種說法不夠專業，更專業的描述就是，抓取特定網站網頁的H

提升Scrapy框架爬取數據效率的五種方式

增加快速少量數據設置 coo ror 超時時間產生取數 1、增加並發線程開啟數量　　settings配置文件中，修改CONCURRENT_REQUESTS = 100,默認為32，可適當增加； 2、降低日誌級別　　運行scrapy時會產生大量日誌占用CP

python制作爬蟲爬取京東商品評論教程

頭文件天津 ref back 文字 eai 目的格式 open 作者：藍鯨類型：轉載本文是繼前2篇Python爬蟲系列文章的後續篇，給大家介紹的是如何使用Python爬取京東商品評論信息的方法，並根據數據繪制成各種統計圖表，非常的細致，有需要的小夥伴可以參考下

第三百三十節，web爬蟲講解2—urllib庫爬蟲—實戰爬取搜狗微信公眾號

文章 odin data 模塊 webapi 頭信息 hone 微信 android 第三百三十節，web爬蟲講解2—urllib庫爬蟲—實戰爬取搜狗微信公眾號封裝模塊 #!/usr/bin/env python # -*- coding: utf-8 -*- impo

Python爬取貓眼top100排行榜數據【含多線程】

代碼 status log col return map result port htm # -*- coding: utf-8 -*- import requests from multiprocessing import Pool from requests.e

Python3.5：爬取網站上電影數據

x64 沒有 () nbsp 運行 lpar target __init__ doc 首先我們導入幾個pyhton3的庫: from urllib import requestimport urllibfrom html.parser import HTMLParser 在

用scrapy框架爬取映客直播用戶頭像

xpath print main back int open for pri nbsp 1. 創建項目 scrapy startproject yingke cd yingke 2. 創建爬蟲 scrapy genspider live 3. 分析http://www.i

python爬取微博圖片數據存到Mysql中遇到的各種坑python Mysql存儲圖片

字符轉義 process 程序 zha 有一個 utf-8 get ctime python3 本人長期出售超大量微博數據，並提供特定微博數據打包，Message to [email protected] 前言由於硬件等各種原因需要把大概

我的第一個Scrapy 程序 - 爬取當當網信息

ref http ide ces passwd lds url ext != 前面已經安裝了Scrapy，下面來實現第一個測試程序。概述 Scrapy是一個爬蟲框架，他的基本流程如下所示（下面截圖來自互聯網）簡單的說，我們需要寫一個item文件，定義返回的數據結構；寫

爬取豆瓣網評論最多的書籍

ups info 程序不容易 ima nta 單元 bs4 很多相信很多人都有書荒的時候，想要找到一本合適的書籍確實不容易，所以這次利用剛學習到的知識爬取豆瓣網的各類書籍，傳送門https://book.douban.com/tag/?view=cloud。首先是這個

scrapy實戰爬取cl社群評論數超過設定值的連結

相關推薦