Scrapy+Seleium爬蟲爬取天眼查資料

阿新 • • 發佈：2018-12-16

#難點：

1.資料介面很難找到，反爬措施很強，所以用的seleium模擬抓取
2.頁面資料字型進行了異常，需要進行反向破解

###本文用的是天眼查移動端 m.tianyancha.com 進行抓取，輸入公司名可以抓取前面5條具體資訊展示 ###還有網站字型異常反爬每天都會更新，所以需要後面使用的需要排除去除，用fontcreator軟體

###程式碼抓取也有些注意點，用的google無頭headless瀏覽器在這裡插入圖片描述

**#Spider檔案
# -*- coding: utf-8 -*-
import scrapy
from tianyancha.items import TianyanchaItem
import re
from fontTools.ttLib import TTFont
#aa需要更新維護反爬
aa = {
    2: 0,
    8: 2,
    0: 4,
    7: 6,
    9: 7,
    6: 8,
    4: 9,
    1: 1,
    5: 5,
    3:3
}
class ChaSpider(scrapy.Spider):
    name = 'cha'
    allowed_domains = ['m.tianyancha.com']
    # start_urls = ['http://m.tianyancha.com/']

    def start_requests(self):
        meta={"nihao":"dawang"}
        a=input("請輸入要查詢的企業名：")
        url="https://m.tianyancha.com/search?key=%s"%a
        yield  scrapy.Request(url=url,callback=self.parse,meta=meta)

    def parse(self, response):
        meta = {"nihao": "dawang"}
        url_lists=response.xpath('//div[contains(@class,"col-xs-10")]/a/@href').extract()
        for url_list in url_lists:
            yield scrapy.Request(url=url_list, callback=self.new_parse, meta=meta)

    def new_parse(self, response):
        item=TianyanchaItem()
        item["company"]=response.xpath('//div[@class="over-hide"]/div/text()').extract()[0]
        # item["boss"]=response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[1]/span[2]/a/text()').extract()[0]
        a=response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[3]/span[2]/text/text()').extract()
        a = a[0].replace("-", "")
        a = list(a)
        bb = []
        for i in range(len(a)):
            aaa = a[i]  # aaa出來是str
            bbb = aa[int(aaa)]
            bb.append(bbb)
        item["registration_time"] = "".join("%s" % id for id in bb) #將列表裡元素按方式拼接成字串
        # item["registration_time"]="".join("%s"%id for id in list(k)) #將列表裡元素按方式拼接成字串
        b=response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[4]/span[2]/text/text()').extract()[0]
        b=re.findall(r"\d+",b)
        b = list(b[0])
        kk = []
        for i in range(len(b)):
            mmm = b[i]  # aaa出來是str
            kkk = aa[int(mmm)]
            kk.append(kkk)
        item["the_registered_capital"] = "".join("%s" % im for im in kk)+"萬"
        # item["the_registered_capital"] = "".join("%s" % id for id in list(kk))
        item["industry"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[5]/span[2]/text()').extract()[0]
        item["the_enterprise_type"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[6]/span[2]/text()').extract()[0]
        item["registration_number"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[7]/span[2]/text()').extract()[0]
        item["organization_code"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[8]/span[2]/text()').extract()[0]
        item["credit_code"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[9]/span[2]/text()').extract()[0]
        item["business_period"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[10]/span[2]/text()').extract()[0]
        item["approval_date"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[11]/span[2]/text()').extract()[0]
        item["registration_authority"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[13]/span[2]/text()').extract()[0]
        item["registered_address"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[14]/span[2]/text()').extract()[0]
        item["scope_of_business"] = response.xpath('/html/body/div[3]/div[1]/div[7]/div/div[15]/span[2]/span/span[2]/div/text/text()').extract()[0]
        # print(item["company"])
        # print(item["registration_time"])
        # print(item["the_registered_capital"])
        print(item)
        yield item

#middlewares檔案
from selenium import webdriver
from  scrapy.http import HtmlResponse



class TianyanchaDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def process_request(self, request, spider):
        if request.meta["nihao"] =="dawang" :
            option = webdriver.ChromeOptions()
            option.add_argument('--headless')
            option.add_argument('--disable-gpu')
            driver = webdriver.Chrome(chrome_options=option)
            # driver=webdriver.Chrome()
            driver.get(request.url)
            content=driver.page_source
            driver.quit()
            return HtmlResponse(request.url,encoding="utf-8",body=content,request=request)

Scrapy+Seleium爬蟲爬取天眼查資料

#難點： 1.資料介面很難找到，反爬措施很強，所以用的seleium模擬抓取 2.頁面資料字型進行了異常，需要進行反向破解 ###本文用的是天眼查移動端 m.tianyancha.com 進行抓取，輸入公司名可以抓取前面5條具體資訊展示 ###還有網站字

java爬取天眼查並存入excel中

功能：自動讀取comyang.txt檔案中的公司名進行搜尋把搜尋到含有公司詳細資訊的html儲存在info資料夾把html檔案中的資訊提取到excel表格中判斷是否出現機器人驗證斷點續查（關了再開啟不會重複查詢）缺點：無法跳過機器人驗證程式

20180213 爬蟲爬取空氣質量資料

目標網址：空氣質量歷史資料 1、修改爬蟲原因：網址針對爬蟲作了防範措施，直接爬取很難奏效。 2、google 的webdriver難以get內容，也許是網站針對性的進行了防範思路： 1、利用Cenenium+PlatformJS 模擬瀏覽器請求一個頁面 2、Pandas裡

scrapy框架爬蟲爬取糗事百科之 Python爬蟲從入門到放棄第不知道多少天（1）

Scrapy框架安裝及使用 1. windows 10 下安裝 Scrapy 框架：　　前提：安裝了python-pip 　　1. windows下按住win+R 輸入cmd 　　2. 在cmd 下輸入　　　　　　pip install scrapy 　　　　　　pip inst

爬蟲——爬取人民網資料生成詞雲圖

1、以人民網的新聞資料為例，簡單介紹的利用python進行爬蟲，並生成詞雲圖的過程。首先介紹python的requests庫，它就好像是一個“爬手”，負責到使用者指定的網頁上將所需要的內容爬取下來，供之後的使用。我們可以利用python的pip功能下載requests庫，在cmd視窗輸入

Python爬蟲--爬取歷史天氣資料

寫在前面：爬蟲是老鼠屎在進入實驗室後接觸的第一個任務，當時剛剛接觸程式碼的老鼠屎一下子迎來了地獄難度的爬微博簽到資料。爬了一個多月毫無成果，所幸帶我的師兄從未給我疾言厲色，他給與了我最大的包容與理解。儘管無功而返，但是那一個月也給了老鼠屎充足的學習時間，讓老鼠屎對爬蟲

（8）Python爬蟲——爬取豆瓣影評資料

利用python爬取豆瓣最受歡迎的影評50條的相關資訊，包括標題,作者,影片名,影片詳情連結,推薦級,迴應數,影評連結,影評,有用數這9項內容，然後將爬取的資訊寫入Excel表中。具體程式碼如下： #!/usr/bin/python # -*- codin

python爬蟲爬取貓眼電影資料

# 定義一個函式獲取貓眼電影的資料 import requests def main(): url = url = 'http://maoyan.com/board/4?offset=0' html = requests.get(url).text

Springboot+JPA下實現簡易爬蟲--爬取豆瓣電視劇資料

Springboot+JPA下實現簡易爬蟲--爬取豆瓣電視劇資料　　前言：今天聽到產品那邊討論一些需求，好像其中一點是使用者要求我們爬蟲，在網頁上抓取一些資料然後存到我們公司資料庫中，眾所周知，爬蟲的實現對於python語言可是專家，而對於我們使用的Java語言，我也不確定可不可以，趁著無事，上網參考了下

將scrapy爬蟲框架爬取到的資料存入mysql資料庫

使用scrapy爬取網站資料，是一個目前來說比較主流的一個爬蟲框架，也非常簡單。 1、建立好專案之後現在settings.py裡面把ROBOTSTXT_OBEY的值改為False，不然的話會預設遵循robots協議，你將爬取不到任何資料。 2、在爬蟲檔案裡開始寫

Python 爬蟲爬取單個基因表格資料的生物學功能（urllib+正則表示式）：

Python 爬蟲爬取單個基因的生物學功能（urllib+正則表示式）： import re import urllib from urllib import request url = 'https://www.ncbi.nlm.nih.gov/gene/?term=FUT1'

網路爬蟲-爬取指定城市空氣質量檢測資料

爬取指定城市空氣質量檢測資料網站連結 → https://www.aqistudy.cn/historydata/ 以月資料為例，見下圖：然後我們通過console除錯可以發現這個網頁在items裡面已經將資料打包好了，如下圖所示沒毛病，資料全都對得上，接下來的思

python：爬蟲爬取資料的處理之Json字串的處理（2）

#Json字串的處理 Json字串轉化為Python資料型別 import json JsonStr ='{"name":"sunck","age":"18","hobby":["money","power","English"],"parames":{"a":1,"b":2}}' Js

python ：通過爬蟲爬取資料（1）

(1)通過url爬取網頁資料 import urllib.request #指定url url ="https://www.baidu.com" #向伺服器發起請求，返回響應的資料，通過infor接收 infor = urllib.request.urlopen(url)

【python學習筆記】37：認識Scrapy爬蟲,爬取滬深A股資訊

學習《Python3爬蟲、資料清洗與視覺化實戰》時自己的一些實踐。認識Scrapy爬蟲安裝書上說在pip安裝會有問題，直接在Anaconda裡安裝。建立Scrapy專案 PyCharm裡沒有直接的建立入口，在命令列建立（從Anaconda安裝後似乎自動就

Python爬蟲爬取網上圖片原始碼，可用來製作深度學習資料集

這次利用python設計一個爬取百度圖片上的圖片的原始碼，其中利用的是python的urllib，如果沒有裝的，可以使用Anconda在環境裡進行安裝或者 pip install urllib 這兩種方式都可以安裝，長話短說，上圖吧，點選執行後，輸入你要下載的圖片型別：比如，熊貓？美女？

scrapy-redis例項，分佈爬蟲爬取騰訊新聞，儲存在資料庫中

本篇文章為scrapy-redis的例項應用，原始碼已經上傳到github: https://github.com/Voccoo/NewSpider 使用到了： python 3.x redis scrapy-redis pymysql Redis-Desktop-Manage

python爬蟲爬取今日頭條APP資料（無需破解as ,cp，_cp_signature引數）

#!coding=utf-8 import requests import re import json import math import random import time from requests.packages.urllib3.exceptions import Insecure

python爬蟲爬取京東店鋪商品價格資料(更新版)

主要使用的庫： requests:爬蟲請求並獲取原始碼 re：使用正則表示式提取資料 json:使用JSON提取資料 pandas：使用pandans儲存資料 ##sqlalchemy ：備用方案，上傳資料到mysql 以下是原始碼： # -*- coding:utf

python爬蟲爬取淘寶搜尋頁面商品資訊資料

主要使用的庫： requests:爬蟲請求並獲取原始碼 re：使用正則表示式提取資料 json:使用JSON提取資料 pandas：使用pandans儲存資料以下是原始碼： #!coding=utf-8 import requests import re import

Scrapy+Seleium爬蟲爬取天眼查資料

相關推薦