【學習筆記】python爬取百度真實url

阿新 • • 發佈：2017-09-08

python

今天跑個腳本需要一堆測試的url，，，挨個找復制粘貼肯定不是程序員的風格，so，還是寫個腳本吧。

環境：python2.7

編輯器：sublime text 3

一、分析一下

首先非常感謝百度大佬的url分類非常整齊，都在一個類下

技術分享

即c-showurl,所以只要根據css爬取鏈接就可以，利用beautifulsoup即可實現，代碼如下：

        soup = BeautifulSoup(content,‘html.parser‘)
        urls = soup.find_all("a",class_=‘c-showurl‘)

還有另外的一個問題是百度對url進行了加密，要想獲得真實的url，我的思路是訪問一遍加密的url，再獲得訪問界面的url，這時獲取到的url即為真實的url。

完整代碼如下：

#coding = utf-8
import requests
from bs4 import BeautifulSoup
import time

headers = {
                    ‘Accept‘:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8‘,
                    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 QIHU 360SE‘
            }

page_start = raw_input(‘please input stratpage\n‘)
page_end = raw_input(‘please input endpage\n‘)
word = raw_input(‘please input keyword\n‘)

if page_start == 1:
    page_start = 0
else:
    page_start = (int(page_start)-1)*10

page_end = (int(page_end)-1)*10

for i in range(page_start,page_end,10):
    url = ‘http://www.baidu.com/s?wd=‘+word+‘&pn=‘+str(i)
    try:
        response = requests.get(url,headers=headers,timeout=10)
        print ‘downloading...‘+url
        content = response.content
        soup = BeautifulSoup(content,‘html.parser‘)
        urls = soup.find_all("a",class_=‘c-showurl‘)
        for href in urls:
            a = href[‘href‘]
            try:
                res = requests.get(a,headers=headers,timeout=10)
                with open(‘urls.txt‘,‘a‘) as f:
                    f.write(res.url)
                    f.write(‘\n‘)
                time.sleep(1)
            except Exception,e:
                print e
                pass
    except Exception,e:
        print e
        pass

當然，這只是簡單的功能，如果爬取大量的url，建議利用線程進行處理，不然等到爬完也等到地老天荒了。。。。。我是爬取百十來個url，親測還可以。

本文出自 “踟躕” 博客，請務必保留此出處http://chichu.blog.51cto.com/11287515/1963693

【學習筆記】python爬取百度真實url

python 今天跑個腳本需要一堆測試的url，，，挨個找復制粘貼肯定不是程序員的風格，so，還是寫個腳本吧。環境：python2.7 編輯器：sublime text 3 一、分析一下首先非常感謝百度大佬的url分類非常整齊，都在一個

【學習筆記】python 進階特性

可能 pytho red nbsp python blog 有一個自省 blue __slots__魔法在Python中，每個類都有實例屬性。默認情況下Python用一個字典來保存一個對象的實例屬性。這非常有用，因為它允許我們在運行時去設置任意的新屬性。然而，對於有

【學習筆記】python-日誌logging

and 輸出流 Matter message deb 實例化 formatter 創建級別一、日誌分為幾個級別？debug--調試信息info--詳細信息：數據進度warning 警告信息error 錯誤信息critical 致命的嚴重的錯誤二、實例impo

【學習筆記】Python基礎-字典Dict和Set和List與Str擴充套件

Dict 使用大括號圍起來，這裡提供一種鍵值對的list表示方法 1. Dict {} 2. List [] 3. turple () 例項程式碼 #!/usr/bin/env python3 # -*- coding: utf-8 -*- #

【學習筆記】Python基礎-aiohttp

aiohttp 的初始化函式init()也是一個coroutine，loop.create_server()則利用asyncio建立TCP服務安裝 aiohttp 安裝命令: pip install aiohttp D:\PythonProjec

python爬取百度搜索圖片

知乎需要 with 異常 mage 不足 request height adr 在之前通過爬取貼吧圖片有了一點經驗，先根據之前經驗再次爬取百度搜索界面圖片廢話不說，先上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- #

Python爬取百度貼吧數據

utf-8 支持我 family encode code word keyword 上一條時間　　本渣除了工作外，在生活上還是有些愛好，有些東西，一旦染上，就無法自拔，無法上岸，從此走上一條不歸路。花鳥魚蟲便是我堅持了數十年的愛好。　　本渣還是需要上班，才能支持我的

python爬取百度搜索結果ur匯總

百度搜索 sta attr amp end rom range 百度篩選寫了兩篇之後，我覺得關於爬蟲，重點還是分析過程分析些什麽呢： 1）首先明確自己要爬取的目標　　比如這次我們需要爬取的是使用百度搜索之後所有出來的url結果 2）分析手動進行的獲取目標的過程，以便

python 爬取百度url

style not 域名 head dex fin compile threads www 1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Date : 2017-08-29 18:38:23 4

python爬取百度圖片代碼

python爬蟲；import json import itertools import urllib import requests import os import re import sys word=input("請輸入關鍵字：") path="./ok" if

python爬取百度翻譯返回：{'error': 997, 'from': 'zh', 'to': 'en', 'query 問題

escape result words fan use rip odin 解決 base 解決辦法：修改url為手機版的地址：http://fanyi.baidu.com/basetrans User-Agent也用手機版的測試代碼： # -*- coding: utf

python爬取百度貼吧指定內容

環境:python3.6 1：抓取百度貼吧—linux吧內容基礎版抓取一頁指定內容並寫入檔案萌新剛學習Python爬蟲,做個練習貼吧連結: http://tieba.baidu.com/f?kw=linux&ie=utf-8&pn=0 解析原始碼使用的是B

Python 爬取百度圖片的高清原圖

# coding=utf-8 """ 爬取百度圖片的高清原圖 Author : MirrorMan Created : 2017-11-10 """ import re import urllib import os import requests de

python爬取百度圖片---釋出exe小計編碼是個大坑

#*--coding:utf-8--* import requests import sitecustomize import os import sys reload(sys) sys.setdefaultencoding('utf-8') type=sys.getfilesystemencodi

Python爬取百度貼吧標題

# -*- coding: utf-8 -*- """ Created on Sun Nov 4 10:22:07 2018 @author: wangf """ from urllib.request import urlopen import codecs from

python爬取百度旅遊的城市點評文字資料

以青島市為例，檢視網址主要欄位為 pn=0 n?rn=15&pn=0&style=hot#remark-contaier 最後一頁，即183頁 pn=2730 n?rn=15&pn=2730&style=hot#remark-contai

Python爬取百度貼吧圖片指令碼

新手，以下是爬取百度貼吧制定帖子的圖片指令碼，因為指令碼主要是解析html程式碼，因此一旦百度修改頁面前端程式碼，那麼指令碼會失效，權當爬蟲入門練習吧，後續還會嘗試更多的爬蟲。 # coding=ut

Python爬取百度實時熱點排行榜

今天爬取的百度的實時熱點排行榜按照慣例，先下載網站的內容到本地： 1 def downhtml(): 2 url = 'http://top.baidu.com/buzz?b=1&fr=20811' 3 headers = {'User-Agent':'Mozilla/5.0'}

Python爬取百度貼吧回帖中的微訊號（基於簡單http請求）

作者：草小誠轉載請注原文地址：https://blog.csdn.net/cxcjoker7894/article/details/85685115 前些日子媳婦兒有個需求，想要一個任意貼吧近期主題帖的所有回帖中的微訊號，用來做一些微商的操作，你懂的。因為有些貼吧專門就是

Python爬取百度貼吧的圖片

Python是一個弱型別的動態語言下面是我的第一個簡單的爬蟲指令碼程式 #coding=gbk #匯入re和urlLib兩個庫 import re import urllib #定義一個有參的獲得圖片的方法,方法名為getImg def getImg(url):

【學習筆記】python爬取百度真實url

相關推薦