使用lxml的css選擇器用法爬取奇書網並保存到mongoDB中

阿新 • • 發佈：2018-12-10

referer 最新 shu auth style ret bre last tail

import requests
from lxml import etree
from fake_useragent import UserAgent
import pymongo
class QiShuSpider(object):
    def __init__(self):
        self.base_url="https://www.qisuu.la/soft/sort01/"
        self.headers={
            "User-Agent":UserAgent().random,
            "HOST":"www.qisuu.la 
",
            "Referer":"https://www.qisuu.la",
        }



    def get_index_code(self):
        #聲明一個變量，記錄重連的次數
        retry_link_count=0
        while True:
            try:
                response=requests.get(self.base_url,headers=self.headers)
                # print(response.text) 

            except Exception as e:
                print("連接奇書網失敗，原因是:",e)
                print("正在嘗試第{}次重連....".format(retry_link_count))
                retry_link_count+=1
                if retry_link_count>=5:
                    print("嘗試連接次數已經達到五次，停止連接")
                    break
            else 
:
                html_obj=etree.HTML(response.text)
                # print(response.text)
                #獲取option這個標簽列表
                option_list=html_obj.cssselect("select>option")
                return option_list

    def get_every_page_code(self):
        option_list=self.get_index_code()
        for option in option_list:
            value=option.get("value")
            #拼接每一頁的完整地址
            base_url="https://www.qisuu.la"+value
            print("正在爬取{}鏈接".format(base_url))
            response=requests.get(base_url,headers=self.headers).text
            html_obj=etree.HTML(response)
            #獲取每一本小數所在的a標簽的一個列表
            a_list=html_obj.cssselect(".listBox li>a")
            for a in a_list:
                novel_href=a.get("href")
                #拼接每一本小說的完整地址
                novel_url="https://www.qisuu.la"+novel_href
                print("正在爬取鏈接為{}的小說".format(novel_url))
                self.parse_every_novel(novel_url)
    def parse_every_novel(self,novel_url):
        reponse=requests.get(novel_url,headers=self.headers)
        reponse.encoding="utf-8"
        html_obj=etree.HTML(reponse.text)
        novel_name=html_obj.cssselect(".detail_right>h1")[0].text
        clik_num=html_obj.cssselect(".detail_right>ul>li:nth-child(1)")[0].text
        novel_size=html_obj.cssselect(".detail_right>ul>li:nth-child(2)")[0].text
        novel_type=html_obj.cssselect(".detail_right>ul>li:nth-child(3)")[0].text
        update_time = html_obj.cssselect(".detail_right>ul>li:nth-child(4)")[0].text
        novel_status = html_obj.cssselect(".detail_right>ul>li:nth-child(5)")[0].text
        novel_author = html_obj.cssselect(".detail_right>ul>li:nth-child(6)")[0].text
        novel_run_envir=html_obj.cssselect(".detail_right>ul>li:nth-child(7)")[0].text
        novel_lasted_chapter=html_obj.cssselect(".detail_right>ul>li:nth-child(8)>a")[0].text
        dict_novel={"小說名稱":novel_name,"點擊次數":clik_num,"小說大小":novel_size,"小說類型":novel_type,"更新時間":update_time,"小說狀態":novel_status,"小說作者":novel_author,"小說運行環境":novel_run_envir,"小說最新章節":novel_lasted_chapter}
        collection.insert_one(dict_novel)

    def start_spider(self):
        self.get_every_page_code()



if __name__ == ‘__main__‘:
    client = pymongo.MongoClient(host="localhost", port=27017)
    db = client.novel
    collection = db.novel
    spider=QiShuSpider()
    spider.start_spider()

使用lxml的css選擇器用法爬取奇書網並儲存到mongoDB中

import requests from lxml import etree from fake_useragent import UserAgent import pymongo class QiShuSpider(object): def __init__(self):

使用lxml的css選擇器用法爬取奇書網並保存到mongoDB中

referer 最新 shu auth style ret bre last tail import requests from lxml import etree from fake_useragent import UserAgent import pymon

爬取樓盤網並將資料儲存在excel表中

初學，程式碼有點爛，有些錯誤先不處理。 #!/usr/bin/python # -*- coding: <encoding name> -*- import requests from bs4 import BeautifulSoup from openpyxl impor

1.scrapy爬取的數據保存到es中

create date() city sql none tin alc set reat 先建立es的mapping，也就是建立在es中建立一個空的Index，代碼如下：執行後就會在es建lagou 這個index。 from datetime import

Python-selenium翻頁爬取csdn博客保存數據入mysql

一個數據截圖代碼 on() 博客 cat utf8 data csdn博客部分截圖博客鏈接：https://blog.csdn.net/kevinelstri/article/list/1? 此次目的是要爬取文章標題，發表文章時間以及閱讀數量 1.瀏覽器

scrapy爬取海量資料並儲存在MongoDB和MySQL資料庫中

前言一般我們都會將資料爬取下來儲存在臨時檔案或者控制檯直接輸出，但對於超大規模資料的快速讀寫，高併發場景的訪問，用資料庫管理無疑是不二之選。首先簡單描述一下MySQL和MongoDB的區別：MySQL與MongoDB都是開源的常用資料庫，MySQL是傳

scrapy實戰1分布式爬取有緣網：

req 年齡 dict ems arch last rem pen war 直接上代碼： items.py 1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items

[js高手之路] 我的開源javascript框架gdom - 選擇器用法

htm query bsp https title 是我 pad logs ext gdom框架是我開發的一款dom和字符串處理框架，目前版本是1.0.0. 使用方法跟jquery是差不多的, 會用jquery就會用gdom，目前 1.0.0版本的選擇器完全支持CSS3選擇

多線程版爬取故事網

實現 exe don comm value obj nco result nic 前言：為了能以更高效的速度爬取，嘗試采用了多線程本博客參照代碼及PROJECT來源：http://kexue.fm/archives/4385/ 源代碼： 1 #! -*- cod

結對-爬取大麥網演唱會信息-設計文檔

.com ref lock beautiful 模塊有用 pytho spa pil 結對編程成員：閻大為，張躍馨搭建環境： ?1.安裝python2.7 ?2.安裝beautifulsoup4等相關模塊編寫程序階段： ?1.分析html代碼以及了解相

結對-爬取大麥網近期演唱會信息-開發過程

quest 程序 ima ref 時間 -1 git 簡單測試 cnblogs Github：https://github.com/atinst/Python/tree/master/Damai 開發過程：1.根據需求分析，安裝並導入BeautifulSoup和reques

結對-爬取大麥網近期演唱會信息-最終程序

.cn es2017 https png 演唱會 pair ima 技術 img 結對成員:閻大為，張躍馨學號:2015035107201學號:2015035107219 項目托管平臺地址：https://github.com/atinst/Pair-programming

Python爬取天氣網歷史天氣數據

ast 信息爬蟲 cmake tex for roc ins fonts 使用Python的requests 和BeautifulSoup模塊，Python 2.7.12可在命令行中直接使用pip進行模塊安裝。爬蟲的核心是利用BeautifulSoup的select語句獲

爬取豆瓣網評論最多的書籍

ups info 程序不容易 ima nta 單元 bs4 很多相信很多人都有書荒的時候，想要找到一本合適的書籍確實不容易，所以這次利用剛學習到的知識爬取豆瓣網的各類書籍，傳送門https://book.douban.com/tag/?view=cloud。首先是這個

Python爬取全書網小說，免費看小說

tle 3.6 tro con fin 保存 get 正在 url地址什麽是網絡爬蟲網絡爬蟲（又被稱為網頁蜘蛛，網絡機器人，在FOAF社區中間，更經常的稱為網頁追逐者），是一種按照一定的規則，自動地抓取萬維網信息的程序或者腳本。另外一些不常使用的名字還有螞蟻、自

Python爬蟲案例：利用Python爬取笑話網

htm 分享 targe pen 技術分享搞笑 lan tle import 學校的服務器可以上外網了，所以打算寫一個自動爬取笑話並發到bbs的東西，從網上搜了一個笑話網站，感覺大部分還不太冷，html結構如下：可以看到，笑話的鏈接列表都在<div cla

爬取中華網科技新聞

ID ews lse () compose all nal date put 爬取 http://tech.china.com/articles/ 抓取新聞列表中所有分頁的新聞詳情，包括標題、正文、時間、來源等信息。創建項目scrapy startproject Chin

最簡單的網絡圖片的爬取 --Pyhon網絡爬蟲與信息獲取

文件 spa lose man spl roo () pen image 1、本次要爬取的圖片url http://www.nxl123.cn/static/imgs/php.jpg 2、代碼部分 import requestsimport osurl = "ht

Python爬蟲項目--爬取自如網房源信息

xml解析 quest chrom 當前 b2b cal 源代碼 headers 判斷本次爬取自如網房源信息所用到的知識點: 1. requests get請求 2. lxml解析html 3. Xpath 4. MongoDB存儲正文 1.分析目標站點 1. url:

Python 利用BeautifulSoup和正則表示式來爬取旅遊網資料

import re import requests import time from bs4 import BeautifulSoup url = ‘http://www.cntour.cn/’ r = requests.get(url) print(r.encoding,len(r.t

使用lxml的css選擇器用法爬取奇書網並保存到mongoDB中

相關推薦