爬蟲模塊之解決IO

阿新 • • 發佈：2018-01-12

con nss hid uci nbsp code pre ati std

一 asyncio模塊

　asyncio模塊：主要是幫我們檢測IO（只能是網路IO）。

　@asyncio.coroutine：裝飾器

　tasks：任務列表

　get_event_loop：起任務

　run_until_complete：提交的方式，檢測任務的執行

　asgncio.gather（任務列表）：直接執行任務

　close：關閉任務

　open_connection：建立鏈接

　yield from：如果阻塞就切換到另外一個任務

　sleep：模仿網絡阻塞IO

　write：將數據包準備好

　send.drain：發送數據包

　read：接收數據

# import asyncio 

#
# @asyncio.coroutine
# def task(task_id,senconds):
#     print(‘%s is runing‘ %task_id)
#     yield from asyncio.sleep(senconds)
#     print(‘%s is done‘ %task_id)
#
#
# tasks=[
#     task(1,3),
#     task(2,2),
#     task(3,1)
# ]
#
# loop=asyncio.get_event_loop()
# loop.run_until_complete(asyncio.gather(*tasks)) 

# loop.close()


#1、按照TCP：建立連接（IO阻塞）
#2、按照HTTP協議：url，請求方法，請求頭，請求頭
#3、發送Request請求（IO）
#4、接收Respone響應（IO）
import asyncio

@asyncio.coroutine
def get_page(host,port=80,url=‘/‘): #https://  www.baidu.com:80  /
    print(‘GET：%s‘ %host)
    recv,send=yield from asyncio.open_connection(host=host,port=port)

    http_pk 
="""GET %s HTTP/1.1\r\nHost:%s\r\n\r\n""" %(url,host)
    send.write(http_pk.encode(‘utf-8‘))

    yield from send.drain()

    text=yield from recv.read()

    print(‘host:%s size:%s‘ %(host,len(text)))

    #解析功能



#http://www.cnblogs.com/linhaifeng/articles/7806303.html
#https://wiki.python.org/moin/BeginnersGuide
#https://www.baidu.com/

tasks=[
    get_page(‘www.cnblogs.com‘,url=‘/linhaifeng/articles/7806303.html‘),
    get_page(‘wiki.python.org‘,url=‘/moin/BeginnersGuide‘),
    get_page(‘www.baidu.com‘,),
]

loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

View Code

二 aiohttp模塊

　aiohttp.request：發送一個request請求

import asyncio
import aiohttp #pip3 install aiohttp

@asyncio.coroutine
def get_page(url): #https://  www.baidu.com:80  /
    print(‘GET：%s‘ %url)
    response=yield from aiohttp.request(‘GET‘,url=url)

    data=yield from response.read()

    print(‘url:%s size:%s‘ %(url,len(data)))


#http://www.cnblogs.com/linhaifeng/articles/7806303.html
#https://wiki.python.org/moin/BeginnersGuide
#https://www.baidu.com/

tasks=[
    get_page(‘http://www.cnblogs.com/linhaifeng/articles/7806303.html‘),
    get_page(‘https://wiki.python.org/moin/BeginnersGuide‘),
    get_page(‘https://www.baidu.com/‘,),
]

loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

View Code

三 twisted模塊

　twisted：異步IO框架模塊

　getpage：發送請求

　internet.reactor：

　addCalllback：綁定回調函數

　defer.DeferredList：

　reactor.run：起循環來負責執行任務

　addBoth：所有的任務都執行完畢過後執行的事，接收的參數是回調函數返回的結果

　reactor.stop：終止程序的執行

‘‘‘
#問題一：error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
pip3 install C:\Users\Administrator\Downloads\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
pip3 install twisted

#問題二：ModuleNotFoundError: No module named ‘win32api‘
https://sourceforge.net/projects/pywin32/files/pywin32/

#問題三：openssl
pip3 install pyopenssl
‘‘‘

#twisted基本用法
from twisted.web.client import getPage,defer
from twisted.internet import reactor

def all_done(arg):
    # print(arg)
    reactor.stop()

def callback(res):
    print(res)
    return 1

defer_list=[]
urls=[
    ‘http://www.baidu.com‘,
    ‘http://www.bing.com‘,
    ‘https://www.python.org‘,
]
for url in urls:
    obj=getPage(url.encode(‘utf=-8‘),)
    obj.addCallback(callback)
    defer_list.append(obj)

defer.DeferredList(defer_list).addBoth(all_done)

reactor.run()




#twisted的getPage的詳細用法
from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parse


def one_done(arg):
    print(arg)
    reactor.stop()

post_data = urllib.parse.urlencode({‘check_data‘: ‘adf‘})
post_data = bytes(post_data, encoding=‘utf8‘)
headers = {b‘Content-Type‘: b‘application/x-www-form-urlencoded‘}
response = getPage(bytes(‘http://dig.chouti.com/login‘, encoding=‘utf8‘),
                   method=bytes(‘POST‘, encoding=‘utf8‘),
                   postdata=post_data,
                   cookies={},
                   headers=headers)
response.addBoth(one_done)

reactor.run()

View Code

四 trnado模塊

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop


def handle_response(response):
    """
    處理返回值內容（需要維護計數器，來停止IO循環），調用 ioloop.IOLoop.current().stop()
    :param response: 
    :return: 
    """
    if response.error:
        print("Error:", response.error)
    else:
        print(response.body)


def func():
    url_list = [
        ‘http://www.baidu.com‘,
        ‘http://www.bing.com‘,
    ]
    for url in url_list:
        print(url)
        http_client = AsyncHTTPClient()
        http_client.fetch(HTTPRequest(url), handle_response)


ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

View Code

爬蟲模塊之解決IO

con nss hid uci nbsp code pre ati std 一 asyncio模塊　asyncio模塊：主要是幫我們檢測IO（只能是網路IO）。　@asyncio.coroutine：裝飾器　tasks：任務列表　get_event_loop：起任務

python爬蟲模塊之URL管理器

ini app 重要但是 visit return 管理器 queue init URL管理器模塊一般是用來維護爬取的url和未爬取的url已經新添加的url的，如果隊列中已經存在了當前爬取的url了就不需要再重復爬取了，另外防止造成一個死循環。舉個例子我爬www.b

python爬蟲模塊之HTML下載模塊

com cond 判斷 session eth mock 表示 += HA HTML下載模塊該模塊主要是根據提供的url進行下載對應url的網頁內容。使用模塊requets-HTML，加入重試邏輯以及設定最大重試次數，同時限制訪問時間，防止長時間未響應造成程序假死現象。

python爬蟲模塊之HTML解析模塊

str 修改 ini lxml 轉換 def imp dom對象 list 這個就比較簡單了沒有什麽好強調的，如果返回的json 就是直接按照鍵值取，如果是網頁就是用lxml模塊的html進行xpath解析。 from lxml import html import js

Python-網絡爬蟲模塊-requests模塊之響應-response

返回方法 sed ons 網絡 limit 數據響應頭 args 當requests發送請求成功後，requests就會得到返回值，如果服務器響應正常，就會接收到響應數據； Response響應中的屬性和方法常用屬性： status_code: 數據類

python基礎之模塊之os模塊

os pythonpython基礎之模塊之os模塊os模塊的作用：　　os，語義為操作系統，所以肯定就是操作系統相關的功能了，可以處理文件和目錄這些我們日常手動需要做的操作，就比如說：顯示當前目錄下所有文件/刪除某個文件/獲取文件大小……　　另外，os模塊不受平臺限制，也就是說：當我們要在linux中顯示當前

Nginx模塊之Nginx-Ts-Module學習筆記（一）搶險體驗

學習筆記體驗 nginx模塊 int images clas tps gin issues 1、通過HTTP接收MPEG-TS2、生產和管理Live HLS 3、按照官方的編譯和配置，當然了我是第一次編譯沒有通過，在作者重新調整下，編譯成功，感謝：@arut https:

python模塊之xml.etree.ElementTree

pat symbol fun import 數據 pyhton hat print off Python有三種方法解析XML，SAX，DOM，以及ElementTree###1.SAX (simple API for XML ) pyhton 標準庫包含SAX解

saltstack模塊之pkg相關模塊

saltstack 模塊 pkg 軟件 pkgs pkg.install 1、pkg.available_version模塊pkg.available_version: 返回所查詢軟件包可供安裝或更新的最新版本。如果指定多個軟件包，則以字典的形式輸出返回結果。[[email

saltstack模塊之file相關模塊

saltstack file 模塊文件操作 1、file.access模塊file.access：測試salt進程是否有對指定文件的對應訪問權限。[[email protected]/* */ ~]# salt ‘*‘ file.access /etc/passwd f s

saltstack模塊之service及crond相關模塊

saltstack service 模塊定時任務 crond 服務 1、service.available模塊service.available：如果服務可用則返回True，否則返回False。[[email protected]/* */ ~]# salt ‘*‘ se

saltstack模塊之user及group相關模塊

saltstack user 模塊創建 group 刪除 1、group.add模塊group.add：添加指定用戶組。[[email protected]/* */ ~]# salt ‘salt-minion02.contoso.com‘ group.add user1

硬件工程師必會電路模塊之MOS管應用（轉）

增強 aliyun vgs conn www oot 信號 .com desc **本文你可以獲得什麽？實際工程應用中常用的MOS管電路（以筆記本主板經典電路為例）；學到實際系統中用到的開關電路模塊以及MOS管非常重要的隔離電路（結合IIC的數據手冊和筆記本主板應

Python基礎（11）_python模塊之time模塊、rando模塊、hashlib、os模塊

路徑固定 val 登錄密碼 rand getcwd ges ble sun 一、模塊 1、什麽是模塊：一個模塊就是一個包含了python定義和聲明的文件，文件名就是模塊名字加上.py的後綴　　模塊的本質：模塊的本質是一個py文件 2、模塊分為三類：1）內置模塊；2）第三

Python--模塊之time、random、os、hashlib

常用 alex hex imp 分割 isa port 計算機存在一、 time模塊表示時間我們通常用三種形式：時間戳(timestamp)：通常來說，時間戳表示的是從1970年1月1日00:00:00開始按秒計算的偏移量。我們運行“type(time.time(

Python模塊之sys

第一個元素程序 form argv 命令行版本 path環境變量命令行參數退出程序 sys.argv 命令行參數List，第一個元素是程序本身路徑sys.exit(n) 退出程序，正常退出時exit(0)sys.version 獲取Pyt

Python--模塊之sys模塊、logging模塊、序列化json模塊、序列化pickle模塊

title 數字 spa etl 信息 none 發送 message 添加多個 sys模塊 sys.argv 命令行參數List，第一個元素是程序本身路徑 sys.exit(n) 退出程序，正常退出時exit(0) sys.path

python模塊之itertools

ddl 編程範式 izip 遍歷真假 iterator return 字典 with 在循環對象和函數對象中，我們了解了循環器(iterator)的功能。循環器是對象的容器，包含有多個對象。通過調用循環器的next()方法 (__next__()方法，在Python 3.

Python基礎（13）_python模塊之re模塊(正則表達式)

取反 clas 執行 true dha blog strong 邊界 .com 8、re模塊：正則表達式　　就其本質而言，正則表達式（或 RE）是一種小型的、高度專業化的編程語言，（在Python中）它內嵌在Python中，並通過 re 模塊實現。正則表達式模式被編譯

python第三方模塊之paramiko模塊

con comm get res str 文件 stdin path color 目錄： paramiko模塊介紹 paramiko模塊安裝 paramiko模塊使用一、paramiko模塊介紹 paramiko是一個用於做遠程控制的模塊，使用該模塊可以對遠程服務器進

爬蟲模塊之解決IO

相關推薦