爬取陽光問政平臺

阿新 • • 發佈：2018-06-22

鏈接 import ML ont con spa sta http type

創建項目

scrapy startproject dongguan

items.py

import scrapy


class DongguanItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    content = scrapy.Field()
    url = scrapy.Field()
    number = scrapy.Field()

創建CrawSpider，使用模版crawl

scrapy genspider -t crawl sun wz.sun0769.com

sun.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dongguan.items import DongguanItem

class SunSpider(CrawlSpider):
    name = ‘sun‘
    allowed_domains = [‘wz.sun0769.com‘]
    start_urls = [‘http://wz.sun0769.com/index.php/question/questionType?type=4&page=0 
‘]

    rules = (
        Rule(LinkExtractor(allow=r‘type=4&page=\d+‘)),
        Rule(LinkExtractor(allow=r‘/html/question/\d+/\d+.shtml‘), callback = ‘parse_item‘),
    )

    def parse_item(self, response):
        item = DongguanItem()

        item[‘title‘] = response.xpath(‘//div[contains(@class, "pagecenter p3")]//strong/text() 
‘).extract()[0]
        # 編號
        item[‘number‘] = item[‘title‘].split(‘ ‘)[-1].split(":")[-1]
        # 內容
        item[‘content‘] = response.xpath(‘//div[@class="c1 text14_2"]/text()‘).extract()[0]
        # 鏈接
        item[‘url‘] = response.url

        yield item

pipelines.py

import json

class DongguanPipeline(object):
    def __init__(self):
        self.filename = open("dongguan.json", "w")

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii = False) + ",\n"
        self.filename.write(text.encode("utf-8"))
        #python3中需改為：self.filename.write(text)

return item def close_spider(self, spider): self.filename.close()

settings.py

BOT_NAME = ‘dongguan‘

SPIDER_MODULES = [‘dongguan.spiders‘]
NEWSPIDER_MODULE = ‘dongguan.spiders‘

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    ‘dongguan.pipelines.DongguanPipeline‘: 300,
}

LOG_FILE = "dg.log"
LOG_LEVEL = "DEBUG"

執行

scrapy crawl sun

發現爬取內容有缺失

問題分析：

通過 print(response.url)分析：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dongguan.items import DongguanItem

class SunSpider(CrawlSpider):
    name = ‘sun‘
    allowed_domains = [‘wz.sun0769.com‘]
    start_urls = [‘http://wz.sun0769.com/index.php/question/questionType?type=4&page=0‘]

    rules = (
        Rule(LinkExtractor(allow=r‘type=4&page=\d+‘),callback = ‘parse_item‘),
        #Rule(LinkExtractor(allow=r‘/html/question/\d+/\d+.shtml‘), callback = ‘parse_item‘),
    )

    def parse_item(self, response):
        print(response.url)
        ‘‘‘
        item = DongguanItem()

        item[‘title‘] = response.xpath(‘//div[contains(@class, "pagecenter p3")]//strong/text()‘).extract()[0]
        # 編號
        item[‘number‘] = item[‘title‘].split(‘ ‘)[-1].split(":")[-1]
        # 內容
        item[‘content‘] = response.xpath(‘//div[@class="c1 text14_2"]/text()‘).extract()[0]
        # 鏈接
        item[‘url‘] = response.url

        yield item
        ‘‘‘

技術分享圖片

更改匹配規則：

    rules = (
        Rule(LinkExtractor(allow=r‘type=4‘),callback = ‘parse_item‘),
    )

技術分享圖片

設置

follow=True

修改sun.py

技術分享圖片

響應內容不一定是發送的url，後面的URL無效。

改寫sun.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from newdongguan.items import NewdongguanItem

class DongdongSpider(CrawlSpider):
    name = ‘dongdong‘
    allowed_domains = [‘wz.sun0769.com‘]
    start_urls = [‘http://wz.sun0769.com/index.php/question/questionType?type=4&page=‘]

    # 每一頁的匹配規則
    pagelink = LinkExtractor(allow=("type=4"))
    # 每一頁裏的每個帖子的匹配規則
    contentlink = LinkExtractor(allow=(r"/html/question/\d+/\d+.shtml"))

    rules = (
        # 本案例的url被web服務器篡改，需要調用process_links來處理提取出來的url
        Rule(pagelink, process_links = "deal_links"),
        Rule(contentlink, callback = "parse_item")
    )

    # links 是當前response裏提取出來的鏈接列表
    def deal_links(self, links):
        for each in links:
            each.url = each.url.replace("?","&").replace("Type&","Type?")
        return links

    def parse_item(self, response):
        item = NewdongguanItem()
        # 標題
        item[‘title‘] = response.xpath(‘//div[contains(@class, "pagecenter p3")]//strong/text()‘).extract()[0]
        # 編號
        item[‘number‘] = item[‘title‘].split(‘ ‘)[-1].split(":")[-1]
        # 內容，先使用有圖片情況下的匹配規則，如果有內容，返回所有內容的列表集合
        content = response.xpath(‘//div[@class="contentext"]/text()‘).extract()
        # 如果沒有內容，則返回空列表，則使用無圖片情況下的匹配規則
        if len(content) == 0:
            content = response.xpath(‘//div[@class="c1 text14_2"]/text()‘).extract()
            item[‘content‘] = "".join(content).strip()
        else:
            item[‘content‘] = "".join(content).strip()
        # 鏈接
        item[‘url‘] = response.url

        yield item

爬取陽光問政平臺

鏈接 import ML ont con spa sta http type 創建項目 scrapy startproject dongguan items.py import scrapy class DongguanItem(scrapy.Item): #

爬蟲——Scrapy框架案例二：陽光問政平臺

web url地址 blog rem idt xpath disable ora ole 陽光熱線問政平臺 URL地址：http://wz.sun0769.com/index.php/question/questionType?type=4&page= 爬取字段：帖

Python爬蟲爬取OA幸運飛艇平臺獲取數據

sta 獲取數據 status fail attrs color wrapper 排行榜 req 安裝BeautifulSoup以及requests 打開window 的cmd窗口輸入命令pip install requests 執行安裝，等待他安裝完成就可以了 Beaut

爬取愛問知識人，問題及回答

主要原始碼： aiwen_spider.py： import scrapy from aiwen.items import AiwenItem class aiwenSpider( scrapy.Spider): name = “aiwen” allowed_domains = “/ia

爬取陽光寬頻網的視訊

import requests from lxml import etree import json import os from selenium import webdriver import time class LoadVideos(object):

python實現數據爬取-清洗-持久化存儲-數據平臺可視化

爬蟲 python 數據分析數據清理數據挖掘基於python對淘寶模特個人信息進行篩選爬取，數據清洗，持久化寫入mysql數據庫.使用django對數據庫中的數據信息篩選並生成可視化報表進行分析。數據爬取，篩選，存庫：# -*- coding:utf-8 -*- import

微信PK10平臺開發與用python爬取微信公眾號文章

網址谷歌瀏覽器 pytho google http 開發微信安裝python rom 本文通過微信提供微信PK10平臺開發[q-21528-76294] 網址diguaym.com 的公眾號文章調用接口，實現爬取公眾號文章的功能。註意事項 1.需要安裝python s

python爬蟲爬取各大平臺女主播圖片

目標: 各大直播平臺~~~(虎牙,熊貓,鬥魚,全民),內的女主播直播封面圖片. 所需掌握知識: re正則表示式的,os模組,urllib模組剛剛將這幾個平臺的顏值區域女主播都爬了一遍,整體來說步驟大致相同,我們這裡就拿”虎牙直播”來做個示範,看懂之後,可以先去嘗試爬取”

pyhton爬蟲爬取電商平臺商品歷史價格、最低價格（慢慢買網）

主要使用的庫： requests:爬蟲請求並獲取原始碼 re：使用正則表示式提取資料 json:使用JSON提取資料 pandas：使用pandans儲存資料 #!coding=utf-8 import requests import os import re import

爬取鬥魚平臺

知識點： 1.運用selenium自動化驅動模組 2.find_elements_by_xpath（）與fin_element_by_xpath（）的區別，以及對元素的定位，內容的提取 3.獲取請求下一頁方法，注：time.sleep() 程式碼： #encoding=utf-8

使用seleinum模組爬取熊貓直播平臺全部的主播房間。

下面我就直接放全部程式碼，主要地方我都有註釋，就不一一在程式碼外寫出來了： # author: aspiring from selenium import webdriver import time import json class XiongmaoSp

Python爬蟲入門教程 15-100 石家莊政民互動資料爬取

寫在前面今天，咱抓取一個網站，這個網站呢，涉及的內容就是網友留言和回覆，特別簡單，但是網站是gov的。網址為 http://www.sjz.gov.cn/col/1490066682000/index.html 首先宣告，為了學習，絕無惡意抓取資訊，不管

爬蟲實戰 -- （爬取證券期貨市場失信記錄平臺）

這裡我們要通過實際展示爬取證券期貨市場失信記錄平臺上的搜尋資料。我們現在要通過爬蟲給定一個姓名,機構程式碼，爬取獲得的結果。這裡主要說明兩點： 1. 這是一個動態網頁，因此我採用 selenium 方法。 2.這裡的驗證碼圖片並不在原始碼內，因此前面的

利用Twitter開放者平臺爬取Twitter資料

前言 Twitter對外提供了api介面且Twitter官方提供了Python第三方庫Tweepy，因此我直接參考Tweepy文件寫程式碼。現在Twitter國內是訪問不了的，我配置了Shadowsocks代理，ss預設是用socks5協議，對於Termina

記一次企業級爬蟲系統升級改造（四）：爬取微信公眾號文章（通過搜狗與新榜等第三方平臺）

首先表示抱歉，年底大家都懂的，又涉及SupportYun系統V1.0上線。故而第四篇文章來的有點晚了些~~~對關注的朋友說聲sorry! SupportYun系統當前一覽：　　首先說一下，文章的進度一直是延後於系統開發進度的。　　當前系統V1.0 已經正式上線服役了，這

python 爬取某音樂平臺所有歌單資訊

# coding: utf-8 import requests import os from lxml import etree import json from spider_project.proxies import proxies import random cl

Python模擬登入豆瓣網，並爬取小組信息

count alias pass spa .post windows chrome apr ror import requests from bs4 import BeautifulSoup from PIL import Image headers = { ‘

用接口爬取今日頭條圖片

b+ req ace nco ext odin api data utf #encoding:utf8import requestsimport jsonimport redemo = requests.get(‘http://www.toutiao.com/api/pc/

一個鹹魚的Python爬蟲之路（三）：爬取網頁圖片

you os.path odin 路徑生成存在 parent lose exist 學完Requests庫與Beautifulsoup庫我們今天來實戰一波，爬取網頁圖片。依照現在所學只能爬取圖片在html頁面的而不能爬取由JavaScript生成的圖。所以我找了這個網站

20170513爬取貓眼電影Top100

top compile bs4 etime http res XML n) quest import jsonimport reimport requestsfrom bs4 import BeautifulSoupfrom requests import RequestE

爬取陽光問政平臺

相關推薦