爬取起點小說並存入資料庫

阿新 • • 發佈：2018-12-11

最終效果如下：

······················主程式：·······································

# -*- coding: utf-8 -*-
import scrapy
import  requests
import json
from qidian.items import QidianItem

class MyqidianSpider(scrapy.Spider):
    name = 'myqidian'
    allowed_domains = ['qidian.com']
    start_urls = ['http://www.qidian.com/all?chanId=21&orderId=&page=1&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0']

    def parse(self, response):
        # print(response.text)

        bookList = response.xpath('//ul[@class="all-img-list cf"]/li')
        for i in bookList:
            bookId = i.xpath('./div[@class="book-img-box"]/a/@data-bid').extract()[0]
            bookUrl = 'http:'+ i.xpath('./div[@class="book-img-box"]/a/@href').extract()[0]
            yield scrapy.Request(bookUrl,callback=self.get_url,meta={"bookId":bookId})#把 url , bookId  傳到下一個方法
        #構建翻頁
        page = response.xpath('//@data-pagemax)').extract_first()
        page = int(page)
        for i in range(2, page + 1):
            url = "http://www.qidian.com/all?chanId=21&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page={}".format(
                i)
            yield scrapy.Request(url, callback=self.parse)

    def get_url(self,response):
        meta = response.meta
        bookId = response.meta['bookId']
        jsonurl = 'https://book.qidian.com/ajax/book/category?_csrfToken=OFmDKzipSh4trLG5YRG79dFXcFYAEZgV0cjNceDd&bookId=' + bookId
        bookName = response.xpath('//div[@class="book-info "]/h1/em/text()').extract()[0]
        writerName = response.xpath('//div[@class="book-info "]/h1/span/a/text()').extract()[0]
        xinxi = response.xpath('//div[@class="book-intro"]/p/text()').extract()[0].strip()
        meta =  {
            "bookName" : bookName,"writerName" : writerName, "xinxi" : xinxi
        }
        yield scrapy.Request(jsonurl,callback=self.get_zhangjie,meta = meta)

    def get_zhangjie(self,response):
        meta = response.meta
        bookName = meta['bookName']
        writerName = meta['writerName']
        xinxi = meta['xinxi']
        html = requests.get(response.url).content.decode('utf-8')
        data = json.loads(html)['data']
        vs = data.get('vs')
        for i in vs:
            cs = i.get('cs')
            for i in cs:
                cN = i.get('cN')
                cU = i.get('cU')
                curl = 'https://read.qidian.com/chapter/'+cU
                uT = i.get('uT')
                cnt = i.get('cnt')

                meta = {
                    "bookName": bookName, "writerName": writerName, "xinxi": xinxi,
                    "cN" : cN, "curl" : curl,"uT" : uT,"cnt":cnt
                }
                yield scrapy.Request(curl,callback=self.Lett_text,meta = meta)

    def Lett_text(self,response):
        item = QidianItem()
        meta = response.meta
        item['bookName'] = meta['bookName']
        item['writerName'] = meta['writerName']
        item['xinxi'] = meta['xinxi']
        item['cN'] = meta['cN']
        item['curl'] = meta['curl']
        item['uT'] = meta['uT']
        item['cnt'] = meta['cnt']

        textList = response.xpath('//div[@class="read-content j_readContent"]')
        for text in textList:
            text = text.xpath('//p/text()').extract()[1:]
            item['text'] = ''.join(text).strip().replace('\u3000','')
            yield item


··············item檔案：··························

import scrapy

class QidianItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    bookName = scrapy.Field()
    writerName = scrapy.Field()
    xinxi = scrapy.Field()
    cN = scrapy.Field()
    curl = scrapy.Field()
    uT = scrapy.Field()
    cnt = scrapy.Field()
    text = scrapy.Field()

 ················寫入資料庫

import pymysql
class QidianPipeline(object):
    def __init__(self):
        self.conn = None
        self.cur = None

    def open_spider(self, spider):
        self.conn = pymysql.connect(
            host='127.0.0.1',
            port=3306,
            user='root',
            password='密碼',
            db='pydata201806',
            charset='utf8'
        )
        self.cur = self.conn.cursor()

    def process_item(self, item, spider):
        # if not hasattr(item, 'table_name'):
        #     return item
        cols, values = zip(*item.items())
        sql = "INSERT INTO `%s` (%s) VALUES (%s)" % \
              (
                  'qidianbook',
                  ','.join(cols),
                  ','.join(['%s'] * len(values))
               )
        self.cur.execute(sql, values)
        self.conn.commit()
        print(self.cur._last_executed)
        return item

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()

爬取起點小說並存入資料庫

最終效果如下： ······················主程式：······································· # -*- coding: utf-8 -*- import scrapy import requests im

爬取電影天堂並存入資料庫

程式碼具體如下： from urllib.request import urlopen from urllib.error import HTTPError from bs4 import Bea

java爬取天眼查並存入excel中

功能：自動讀取comyang.txt檔案中的公司名進行搜尋把搜尋到含有公司詳細資訊的html儲存在info資料夾把html檔案中的資訊提取到excel表格中判斷是否出現機器人驗證斷點續查（關了再開啟不會重複查詢）缺點：無法跳過機器人驗證程式

將豆瓣排名前250爬取資料通過sqlite3存入資料庫

#爬取豆瓣top250電影，並儲存到資料庫 import requests from bs4 import BeautifulSoup import sqlite3 def get_html(web_url): user_agent = 'Mozilla/5.0 (Linux; Andro

python 爬蟲使用正則爬取51job內容並存入txt

python爬蟲基礎–使用正則提取51job內容輸出到txt from urllib import request #url url = 'https://search.51job.com/list/020000%252C010000%252C080200%25

爬取關於LPL的微信文章並存入資料庫

import requests from bs4 import BeautifulSoup import time import pymysql def get_HTML(url): hd = {'User-Agent': 'Mozilla/5.0'

Beautiful Soup爬蟲——爬取智聯招聘的資訊並存入資料庫

本人目前在校本科萌新…第一次寫有所不足還請見諒前期準備智聯招聘網頁讓我們來搜尋一下python 發現網頁跳轉到這讓我們看一下原始碼發現並沒有我們所需要的資料一開始我不信邪用requests嘗試了一下 import requests header

Java爬蟲--利用HttpClient和Jsoup爬取部落格資料並存入資料庫

由於今日頭條等頭條類產品的出現，以今日頭條為代表所使用的爬蟲技術正在逐漸火熱，在爬蟲領域具有良好效能和較好效果的Python在最近一年的時間裡逐漸火熱起來，同時因為Python良好的資料分析和機器學習的能力，Python的應用越來越廣泛。不過，今天我們要提到

爬蟲學習之11：爬取豆瓣電影TOP250並存入資料庫

本次實驗主要測試使用PyMySQL庫寫資料進MySQL，爬取資料使用XPATH和正則表示式，在很多場合可以用XPATH提取資料，但有些資料項在網頁中沒有明顯特徵，用正則表示式反而反而更輕鬆獲取資料。直接上程式碼：from lxml import etree impo

Python scrapy實踐應用，爬取電影網站的影片資源並存入資料庫

知識點 scrapy 分頁爬取。 scrapy提取頁面元素之xpath表示式語法 scrapy 配合pymysql儲存爬取到的資料到mysql資料庫 scrapy.Request（……）向回撥方法傳遞額外資料資料庫儲存前先

【Python爬蟲】按時爬取京東幾類自營手機型號價格引數並存入資料庫

一、最近剛好想換手機，然後就想知道京東上心儀的手機價格如何，對比手機價格如何，以及相應的歷史價格，然後就用Python requests+MySQLdb+smtplib爬取相關的資料二、關於實現的主要步驟： 1、根據京東搜尋頁面，搜尋某型號（

Python3 爬取51job的資料存入MongoDB並分析

1.開啟51job首頁，輸入Python，地址選擇深圳，得到搜尋頁面： 3.不同點： items.py新增如下程式碼： from scrapy import Item,Field class JobsItem(Item): # define the f

將scrapy爬蟲框架爬取到的資料存入mysql資料庫

使用scrapy爬取網站資料，是一個目前來說比較主流的一個爬蟲框架，也非常簡單。 1、建立好專案之後現在settings.py裡面把ROBOTSTXT_OBEY的值改為False，不然的話會預設遵循robots協議，你將爬取不到任何資料。 2、在爬蟲檔案裡開始寫

Django小專案--待辦清單（四）（從表單中獲取資料並存入資料庫）

首先進入主頁（要記得先進入虛擬環境並且通過python mange.py runserver啟動本地伺服器），我們知道主頁匹配的網址是localhost:8000/todo/home，在瀏覽器上輸入並回車。在頁面的右上角有一個新增待辦事項的按鈕，輸入內容並點選新

Python3 + Scrapy 爬取豆瓣評分資料存入Mysql與MongoDB資料庫。

首先我們先抓包分析一下，可以看到我們想要的每一頁的全部資料都在"article"下。而其中每一部的電影的資料可以看到在"info"下。所以我們只要在info下找到自己的目標資料並想好匹配方法即可，本文使用的是xpath，其實也可以在spiders中匯入pyquery或者Bea

scrapy爬取海量資料並儲存在MongoDB和MySQL資料庫中

前言一般我們都會將資料爬取下來儲存在臨時檔案或者控制檯直接輸出，但對於超大規模資料的快速讀寫，高併發場景的訪問，用資料庫管理無疑是不二之選。首先簡單描述一下MySQL和MongoDB的區別：MySQL與MongoDB都是開源的常用資料庫，MySQL是傳

利用python爬取點小圖片，滿足私欲(爬蟲)

.text write ret append jpg use download div pat import requestsimport reimport os,syslinks=[]titles=[] headers = { "User-Agent": "Mozi

爬取起點中文網小說介紹信息

OS tex 2.0 user agent lee idp url pri 字數的信息（word）沒有得到缺失 import xlwt import requests from lxml import etree import time all_info_list=[]

爬取網站小豬短租的少量資訊及詳細介紹--爬蟲案例篇

#!/usr/bin/env python # -*- coding:utf-8 -*- # @Time : 18-10-10 下午9:21 import requests #匯入requests包;發請求網頁 from bs4 import BeautifulSoup #匯入bs4包;

scrapy爬蟲框架（三）：爬取桌布儲存並命名

寫在開始之前按照上一篇介紹過的 scrapy爬蟲的建立順序，我們開始爬取桌布的爬蟲的建立。首先，我們先過一遍 scrapy爬蟲的建立順序：第一步：確定要在pipelines裡進行處理的資料，寫好items檔案第二步：建立爬蟲檔案，將所需要的資訊從

爬取起點小說並存入資料庫

相關推薦