python使用requests庫和re庫寫的京東商品信息爬蟲

阿新 • • 發佈：2018-12-05

fin 搜索 goods tle 爬取 val timeout stat for

 1 import requests
 2 import re
 3  
 4 def getHTMLText(url):
 5     try:
 6         r = requests.get(url, timeout=30)
 7         r.raise_for_status()
 8         r.encoding = r.apparent_encoding
 9         return r.text
10     except:
11         return ""
12      
13 def parsePage(ilt, html):
14     try 
:
15         plt = re.findall(r‘data-done="1"><em>￥</em><i>\d+\.\d+</i></strong>‘,html)
16         tlt = re.findall(r‘<em>.+<font class="skcolor_ljg">筆盒</font>.+</em>‘,html)
17         for i in range(len(plt)):
18             match=re.search(r‘\d+\.\d+ 
‘,plt[i])#這個函數返回的對象是match對象，所以用group屬性把價格取出
19             price=match.group(0)                        
20             list_match=re.findall(r‘[\u4e00-\u9fa5]‘,tlt[i])#這個字符串的中文提取我想了好久都沒想到用什麽正則表達式一下子提取出來
21             title=‘‘
22             for m in range(len(list_match)):#後來放棄了用正則表達式一下子提取出來的想法，要是有大佬想到了指點一下唄 

23                 title=title+list_match[m]
24             ilt.append([price , title])
25     except:
26         print("")
27  
28 def printGoodsList(ilt):
29     tplt = "{:4}\t{:8}\t{:16}"
30     print(tplt.format("序號", "價格", "商品名稱"))
31     count = 0
32     for g in ilt:
33         count = count + 1
34         print(tplt.format(count, g[0], g[1]))
35          
36 def main():
37     goods = ‘筆盒‘
38     depth=3
39     start_url=‘https://search.jd.com/Search?keyword=‘+goods+‘&enc=utf-8‘
40     infoList = []
41     for i in range(1,depth):
42         try:
43             url = start_url + ‘&page=‘ + str(2*i-1)
44             html = getHTMLText(url)
45             parsePage(infoList, html)
46         except:
47             continue
48     printGoodsList(infoList)
49 main()

技術分享圖片

1，下面附上參考源碼，來源慕課；原來的爬蟲是爬淘寶首頁商品，不過現在淘寶首頁要登錄驗證，不能直接爬取；但是具有參考價值；

import requests
import re
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return" "
    
def parsePage(ilt,html):
    try:
        plt=re.findall(r‘\"view_price\"\:\"[\d\.]*\" ‘,html)
        tlt=re.findall(r‘\"raw_title\"\:\".*?\"‘,html)
        for i in range(len(plt)):
            price=eval(plt[i].split(‘:‘)[1])
            title=eval(tlt[i].split(‘:‘)[1])
            ilt.append([price,title])
    except:
        print("")

def printGoodsList(ilt):
    tplt="{:4}\t{:8}\t{:16}"
    print(tplt.format("序號","價格","商品名稱"))
    count=0
    for g in ilt:
        count=count+1
        print(tplt.format(count,g[0],g[1]))
        
def main():
    goods=‘書包‘
    depth=2#搜索結果設置為兩頁
    start_url=‘https://s.taobao.com/search?q=‘+goods
    infoList=[]
    for i in range(depth):
        try:
            url=start_url+‘&s=‘+str(44*i)
            html=getHTMLText(url)#把網站文本text爬下來
            parsePage(infoList,html)#然後把文本裏需要的信息爬下來
        except:
            continue
    printGoodsList(infoList)#然後把信息整理一下打印出來

main()

python使用requests庫和re庫寫的京東商品信息爬蟲

fin 搜索 goods tle 爬取 val timeout stat for 1 import requests 2 import re 3 4 def getHTMLText(url): 5 try: 6 r = reques

python使用requests庫和re庫寫的京東商品資訊爬蟲

1 import requests 2 import re 3 4 def getHTMLText(url): 5 try: 6 r = requests.get(url, timeout=30) 7 r.raise_for_status()

Python抓取京東商品信息

Python抓取京東商品信息打開網頁http://item.jd.com/7336413.html定位到“規格與包裝” Python抓取京東商品信息

Java爬蟲爬取京東商品信息

1.2 image 商品 void code 更改 size pri name 以下內容轉載於《https://www.cnblogs.com/zhuangbiing/p/9194994.html》，在此僅供學習借鑒只用。 Maven地址 <dependency>

python語言用requests庫和BeautifulSoup庫爬取京東商品資訊

分析網頁程式碼後寫出程式碼程式碼如下： import requests from bs4 import BeautifulSoup def gettext(url): try: r=requests.get(url) r.e

利用requests庫和pyquery庫爬取指定頁數的京東商品資訊

大概思路：首先利用requests庫獲取京東商品搜尋的頁面資訊，然後利用pyquery庫對爬取的資料進行分析，然後利用格式化輸出的方法輸出所爬取的資料。要爬取的頁面截圖為對前幾頁的網址進行分析可觀察出相應的規律第一頁：https://search.jd.

Re庫的Match對象和Re庫的貪婪匹配以及最小匹配

src img div 貪婪匹配 png 屬性 start pan 操作 Match對象的屬性：屬性說明 .string 待匹配的文本 .re 匹配時使用的pattern對象（正則表達式） .pos

秋名山老司機 (Bugku) re庫和request庫

嘗試寫的第一個python指令碼……之前一直只會用工具（不，有的工具也還不會用……）可以說是很神奇了先貼上程式碼： import requests import re url='http://120.24.86.145:8002/qiumingshan/' r=requests.ses

秋名山老司機 (Bugku) re庫和request庫

嘗試寫的第一個python指令碼……之前一直只會用工具（不，有的工具也還不會用……）可以說是很神奇了先貼上程式碼： import requests import re url='http://120.24.86.145:8002/qiumingshan/' r=requ

ffmpeg的ubuntu的編譯過程（編譯靜態庫和動態庫）

ffmpeg第一步源碼下載通過git下載git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg或者直接下載wget http://ffmpeg.org/releases/ffmpeg-3.3.tar.bz2如果是下載的ffmpeg-3.3.tar.bz2 需要進

關於Linux靜態庫和動態庫的分析

所在 mis color 先後 main 技術哪些共享協議關於Linux靜態庫和動態庫的分析關於Linux靜態庫和動態庫的分析 1.什麽是庫在windows平臺和linux平臺下都大量存在著庫。本質上來說庫是一種可運行代碼的二進制形式。能夠被操作系

靜態庫和動態庫的使用

window != 函數調用 img clas 處理 dllexport void 再看 1.靜態庫使用比較簡單，就兩步 @1包含靜態庫頭文件#include "文件名" @2預處理加載靜態庫 #pragma comment(lib,"庫文件名") 2.再看動態庫 @1首先

C/C++使用libcurl庫發送http請求（get和post可以用於請求html信息，也可以請求xml和json等串）

網絡連接 get 編譯 eas views vs2015 return tar linux C++要實現http網絡連接，需要借助第三方庫，libcurl使用起來還是很方便的環境：win32 + vs2015 如果要在Linux下使用，基本同理 1，下載

Windows靜態庫和動態庫的創建和使用（VS2005）

spec 占用內存庫文件工程存在中間開發程序文件的系統偶們在實際的編程開發中，經常會遇到運行時無法找到某個DLL文件或者鏈接時無法找到某個LIB文件。然後，我們就開始亂GOOGLE一下，然後將VS2005的設置改變一下，或許就Ok了，我們將別人開發的DLL或

linux下的靜態庫和動態庫

列表可執行文件運行時打包文件的 ade 命令 div library 一、linux下的靜態庫靜態庫中的被調用的函數的代碼會在編譯時一起被復制到可執行文件中去的！！可執行文件在運行不需要靜態庫的存在！二、linux下動態庫的構建和使用 1、動態庫的構建

用requests庫和BeautifulSoup4庫爬取新聞列表

ont contents req style quest 新聞列表 soup itl .html import requests from bs4 import BeautifulSoup jq=‘http://news.gzcc.cn/html/2017/xiaoyua

requests庫和BeautifulSoup4庫爬取新聞列表

blog 結果分析代碼 ner eba etime 包裝 mat 畫圖顯示： import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt txt = open("zui

linux+vs2013編譯靜態庫和動態庫

cal 控制文件 urn 運行時 names c++ spec using Linux下創建與使用靜態庫 Linux靜態庫命名規則 Linux靜態庫命名規範，必須是"lib[your_library_name].a"：lib為前綴，中間是靜態庫名，擴展名為.a。創建靜態

linux 靜態庫和動態庫(共享庫)的制作與使用（註意覆蓋問題）

png 環境變量 src bfile idt 鏈接器問題靜態插入一、linux操作系統支持的函數庫分支　　靜態庫：libxxx.a，在編譯時就將庫編譯進可執行程序　　　　優點：程序的運行環境中不需要外部的函數庫　　　　缺點：可執行程序大　　動態庫：又

靜態庫和動態庫的兩種不同的Makefile寫法

不同 .so code lib pic 動態庫 sha stat 靜態動態庫 PROG=add BIN=$(PROG).bin SOADD= lib$(PROG).so SHAREDOBJS= $(PROG).o OBJS= main.o CC=gcc $(PROG)

python使用requests庫和re庫寫的京東商品信息爬蟲

相關推薦