scrapy框架項目：抓取全部知乎用戶信息，並且保存至mongodb

阿新 • • 發佈：2018-09-27

-- resp 用戶信息 ces filter name object api .com

import scrapy
import json,time,re
from zhihuinfo.items import ZhihuinfoItem


class ZhihuSpider(scrapy.Spider):
    name = ‘zhihu‘
    allowed_domains = [‘www.zhihu.com‘]
    start_urls = [‘https://www.zhihu.com/api/v4/members/eve-lee-55/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20‘,]

    def parse(self, response):
        temp_data = json.loads(response.body.decode("utf-8"))["data"]
        count = len(temp_data)
        #如果用戶信息數字低於18 說明已經到達最後一頁
        if count <= 18:
            pass

        #如果沒有達到最後一頁，則改變offset促使爬蟲翻頁
        else:
            offset = re.findall(re.compile(r‘&offset=(.*?)&‘),response.url)[0]
            new_offset = int(offset) + 20
            print(new_offset)
            time.sleep(1)
            yield scrapy.Request("https://www.zhihu.com/api/v4/members/eve-lee-55/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset="+str(new_offset)+"&limit=20",callback=self.parse,dont_filter=True)

        for i in  temp_data:
            #print(i)
            #print("***************"*10)
            #print(response.url)
            #print("***************" * 10)

            item = ZhihuinfoItem()
            item["name"] = i["name"]
            item["url_token"] = i["url_token"]
            item["headline"] = i["headline"]
            item["follower_count"] = i["follower_count"]
            item["answer_count"] = i["answer_count"]
            item["articles_count"] = i["articles_count"]
            item["id"] = i["id"]
            item["type"] = i["type"]

            with open("userinfo.txt") as f:
                user_list = f.read()

            #建立一個文檔，把爬取過的用戶信息其中的url_token寫入，防止重復爬取用戶
            if i["url_token"] not in user_list:
                with open("userinfo.txt","a") as f:  #"a" 是 追加 的意思
                    f.write(i["url_token"]+"-----")
                yield item
                #print(i["url_token"])

                #切換到新的用戶的關註列表內 

                #這樣爬蟲就不斷蔓延，理論上就可以無限爬取完所有互動性強的活躍用戶。
                new_url = "https://www.zhihu.com/api/v4/members/" + i["url_token"] + "/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20"
                time.sleep(1)
                yield scrapy.Request(url=new_url,callback=self.parse)

pipelines

import pymongo
from scrapy.conf import settings

class ZhihuinfoPipeline(object):
    def __init__(self):
        host = settings["MONGODB_HOST"]
        port = settings["MONGODB_PORT"]
        dbname = settings["MONGODB_DBNAME"]
        client = pymongo.MongoClient(host=host,port=port)
        tdb = client[dbname]
        self.post = tdb[settings["MONGODB_DOCNAME"]]

    def process_item(self, item, spider):
        zhihuzhihu = dict(item)
        self.post.insert(zhihuzhihu)
        return item


items

import scrapy


class ZhihuinfoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    url_token = scrapy.Field()
    headline = scrapy.Field()
    follower_count = scrapy.Field()
    answer_count = scrapy.Field()
    articles_count= scrapy.Field()
    id = scrapy.Field()
    type = scrapy.Field()

scrapy框架項目：抓取全部知乎用戶信息，並且保存至mongodb

-- resp 用戶信息 ces filter name object api .com import scrapyimport json,time,refrom zhihuinfo.items import ZhihuinfoItemclass ZhihuSpider(s

如何爬取了知乎用戶信息，並做了簡單的分析

gem 話題 top href pycha 抓取一定的 chat 綠色爬蟲：python27 +requests+json+bs4+time 分析工具： ELK套件開發工具：pycharm 1.性別分布 0 綠色代表的是男性 ^ . ^ 1 代表的是女性 -1

Python爬蟲從入門到放棄（十八）之 Scrapy爬取所有知乎用戶信息(上)

user 說過 -c convert 方式 bsp 配置文件 https 爬蟲爬取的思路首先我們應該找到一個賬號，這個賬號被關註的人和關註的人都相對比較多的，就是下圖中金字塔頂端的人，然後通過爬取這個賬號的信息後，再爬取他關註的人和被關註的人的賬號信息，然後爬取被關註人

利用 Scrapy 爬取知乎用戶信息

oauth fault urn family add token post mod lock 　　思路：通過獲取知乎某個大V的關註列表和被關註列表，查看該大V和其關註用戶和被關註用戶的詳細信息，然後通過層層遞歸調用，實現獲取關註用戶和被關註用戶的關註列表和被關註列表，最終實

房東要給我漲800房租，生氣的我用Python抓取帝都幾萬套房源信息，我主動漲了1000。

__init__ tar extend 簡單 not in 詳細分布 obj soho 老貓我在南五環租了一個80平兩居室，租房合同馬上到期，房東打電話問續租的事，想要加房租；我想現在國家正在也在抑制房價，房子價格沒怎麽漲，房租應該也不會漲，於是霸氣拒絕了，以下是聊天記錄

今日頭條圖片ajax異步加載爬取，並保存至mongodb，以及代碼寫法的改進

exception wow 發現 http img fin 以及 urn form import requests,time,re,json,pymongofrom urllib.parse import urlencodefrom requests.exceptions

Python3 Scrapy框架學習一：爬取貓眼Top100榜

以下操作基於Windows平臺。開啟CMD命令提示框：輸入如下命令：開啟專案裡的items.py檔案，定義如下變數，用於儲存。 class MaoyanItem(scrapy.Item): # define the fields for your

Python3 Scrapy框架學習二：爬取豆瓣電影Top250

開啟專案裡的items.py檔案，定義如下變數， import scrapy from scrapy import Item,Field class DoubanItem(scrapy.Item): # define the fields for your it

Python3 Scrapy框架學習三：爬取煎蛋網加密妹子圖片(全爬)

以下操作基於Windows平臺。開啟CMD命令提示框：新建一個專案如下：開啟專案裡的setting檔案，新增如下程式碼 IMAGES_STORE = './XXOO' #在當前目錄下新建一個XXOO資料夾 MAX_PAGE = 40 #定義爬取的總得頁數

Python3 Scrapy框架學習四：爬取的資料存入MongoDB

1. 新建一個scrapy專案： 2.使用PyCharm開啟該專案 3.在settings.py檔案中新增如下程式碼： #模擬瀏覽器，應對反爬 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK

[js高手之路]Node.js實現簡易的爬蟲-抓取博客所有文章列表信息

r.js 目錄 ref 抓取 {} attr 視頻 json clist 抓取目標：就是我自己的博客：http://www.cnblogs.com/ghostwu/ 需要實現的功能：抓取博客所有的文章標題，超鏈接，文章摘要，發布時間需要用到的庫： node.js自帶的h

二、抓取網絡上的資源信息

activit -o action alt code int sts 手機 sele 一、獲取到網絡上的網頁 from bs4 import BeautifulSoup import requests url = ‘https://www.tripad

部署Asp.net Core 項目發生502.5 或者500 沒有其他提示信息

pat res log led rac utl 導致 con ilog 最近將公司原來.NetCore 1.6的項目升級到.net Core 2.0首先發生 502.5的錯誤,包括IIS日誌,Windows應用程序日誌都沒有記錄問題始終解決不了,首先看看官網給出

python 抓取"一個"網站文章信息放入數據庫

python 文章爬蟲 # coding:utf-8 import requests from bs4 import BeautifulSoup import json import time import datetime import pymysql import sys reload(sy

小程序短視頻項目———開發用戶信息之用戶退出註銷

enter block console otto gin solid right 按鈕 tap 在用戶登陸成功或者註冊成功之後，應該讓用戶跳轉到個人信息頁面。所以接下來進行個人信息功能的開發記載一、用戶個人信息界面的初始化 mine.wxml <view&

小程序短視頻項目———開發用戶信息之用戶上傳頭像

ability throw none 狀態 The 屬性 png .get 獲取一、後端用戶頭像上傳接口開發新建UsersController，用來開發與用戶相關的業務 package com.imooc.controller; import java.io.Fil

小程序短視頻項目———開發用戶信息之查詢用戶信息

信息展示 userinfo json success nbsp service 項目 del req 一、後端接口開發 1、UserController.query( ) 2、service以及impl 顯示界面二·、小程序個人信息展示聯調 min

字符串：截取表單網址裏的信息變成對象

就是字符 bsp fun 判斷 title 信息 amp .com 字符串的方法 1.str.indexOf("?")返回？這個字符串的位置，也就是第幾位 2.str.slice(num)截取從num開始到結束的字符串， str.slice(num1，num2)截取從nu

python設置代理IP來爬取拉勾網上的職位信息，

chrome https htm input post 進行 work port ota import requests import json import time position = input(‘輸入你要查詢的職位：‘) url = ‘https://www

權限管理系統用戶信息 --MyRapid 快速開發框架 Winform

ext import -c font 頁面 port overflow clas 1.7 1.1.2 用戶信息用戶信息對用戶信息進行登記，對於權限管理來說，這裏只有用戶編號具有意義，權限系統根據用戶編號進行用戶識別綁定。其他信息，例如：權限、部門、帳號類型等是框架所需要用到

scrapy框架項目：抓取全部知乎用戶信息，並且保存至mongodb

相關推薦