python：網絡爬蟲的學習筆記

阿新 • • 發佈：2017-06-03

估計 mage codec 課程不能 nic str utf mas

如果要爬取的內容嵌在網頁源代碼中的話，直接下載網頁源代碼再利用正則表達式來尋找就ok了。下面是個簡單的例子：

1 import urllib.request
2 
3 html = urllib.request.urlopen(‘http://www.massey.ac.nz/massey/learning/programme-course/programme.cfm?prog_id=93536‘)
4 html = html.read().decode(‘utf-8‘)

註意，decode方法有時候可能會報錯，例如

1 html = urllib.request.urlopen(‘http://china.nba.com/ 
‘)
2 
3 html = html.read().decode(‘utf-8‘)
4 Traceback (most recent call last):
5 
6   File "<ipython-input-6-fc582e316612>", line 1, in <module>
7     html = html.read().decode(‘utf-8‘)
8 
9 UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xd6 in position 85: invalid continuation byte

具體原因不知道，可以用decode的一個參數，如下

1 html = html.read().decode(‘utf-8‘,‘replace‘)
2 
3 html = urllib.request.urlopen(‘http://china.nba.com/‘)
4 html = html.read().decode(‘utf-8‘,‘replace‘)
5 
6 html
7 Out[9]: ‘<!DOCTYPE html>\r\n<html>\r\n<head lang="en">\r\n    <meta charset="UTF-8">\r\n    <title>NBA?й???????</title>\r\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1">\r\n    <meta name="description" content="NBA?й???????">\r\n    <meta name="keywords"

replace表示遇到不能轉碼的字符就將其替換成問號還是什麽的。。。這也算是一個折中的方法吧。我們繼續回到正題。假如說我們想爬取上面提到的網頁的課程名稱

技術分享

查看網頁源代碼。我用的谷歌瀏覽器，右鍵單擊頁面，再選擇‘查看網頁源代碼’

技術分享

再在這個頁面上ctrl+F，查找你要爬取的字符：技術分享

這個就剛才截圖所對應的代碼（想看懂源代碼還得學習一下html語言啊 http://www.w3school.com.cn/html/index.asp 這個網址挺不錯的）

接下來就是用正則表達式把這個字符串扣下來了：

1 re.findall(‘<h1>.*?</h1>‘,html)
2 Out[35]: [‘<h1>Master of Advanced Leadership Practice (<span>MALP</span>)</h1>‘]

剩下的就是對字符串的切割了：

1 course = re.findall(‘<h1>.*?</h1>‘,html)
2 course = str(course[0])
3 course = course.replace(‘<h1>‘,‘‘)
4 course = course.replace(‘(<span>MALP</span>)</h1>‘,‘‘)

結果：

 1 course = re.findall(‘<h1>.*?</h1>‘,html)
 2 
 3 course = str(course[0])
 4 
 5 course = course.replace(‘<h1>‘,‘‘)
 6 
 7 course = course.replace(‘ (<span>MALP</span>)</h1>‘,‘‘)
 8 
 9 course
10 Out[40]: ‘Master of Advanced Leadership Practice‘

把它寫成一個函數：

1 def get_course(url):
2     html = urllib.request.urlopen(url)
3     html = html.read().decode(‘utf-8‘)
4     course = re.findall(‘<h1>.*?</h1>‘,html)
5     course = str(course[0])
6     course = course.replace(‘<h1>‘,‘‘)
7     course = course.replace(‘ (<span>MALP</span>)</h1>‘,‘‘)
8     return course

這樣輸入該學校的其他課程的網址，同樣也能把那個課程的名稱扣下來（語文不好，請見諒）

1 get_course(‘http://www.massey.ac.nz/massey/learning/programme-course/programme.cfm?prog_id=93059‘)
2 Out[48]: ‘Master of Counselling Studies (<span>MCounsStuds</span>)</h1>‘

這就很尷尬了，原因是第二個replace函數，pattern是錯誤的，看來還得用正則改一下

1 def get_course(url):
2     html = urllib.request.urlopen(url)
3     html = html.read().decode(‘utf-8‘)
4     course = re.findall(‘<h1>.*?</h1>‘,html)
5     course = str(course[0])
6     course = course.replace(‘<h1>‘,‘‘)
7     repl = str(re.findall(‘ \(<span>.*?</span>\)</h1>‘,course)[0])
8     course = course.replace(repl,‘‘)
9     return course

再試試

1 get_course(‘http://www.massey.ac.nz/massey/learning/programme-course/programme.cfm?prog_id=93059‘)
2 Out[69]: ‘Master of Counselling Studies‘

搞定！

其實可以用BeautifulSoup直接解析源代碼，使得查找定位更快。下一篇在說吧

這其實是我在廣州第一份工作要幹的活，核對網址是否存在，是否還是原來的課程。那個主管要人工核對。。。1000多個網址，他說他就是自己人工核對的，哈哈，我可不願意幹這活。當時也嘗試用R語言去爬取課程名，試了很久。。。比較麻煩吧，後來學了python。現在要核對的話估計十分鐘就能搞定1000多個網址了吧。就想裝個b，大家可以無視

python：網絡爬蟲的學習筆記

估計 mage codec 課程不能 nic str utf mas 如果要爬取的內容嵌在網頁源代碼中的話，直接下載網頁源代碼再利用正則表達式來尋找就ok了。下面是個簡單的例子： 1 import urllib.request 2 3 html = urllib.re

python：網絡爬蟲的學習筆記

python：網絡爬蟲的學習筆記

python學習第八十五天：網絡爬蟲之數據解析方式

網絡爬蟲學習軟件篇-Python(一)下載安裝（超詳細教程,傻瓜式說明）

Python 3網絡爬蟲開發實戰+精通Python爬蟲框架Scrapy學習資料

Python 入門網絡爬蟲之精華版

Linux運維之道之網絡基礎學習筆記1.1

python實戰——網絡爬蟲

網絡協議學習筆記1

python寫網絡爬蟲的環境搭建

Python 3網絡爬蟲開發實戰.pdf（崔慶才著）

python3網絡爬蟲學習——基本庫的使用（1）

python3網絡爬蟲學習——基本庫的使用（3）

python3網絡爬蟲學習——使用requests（1）

網絡基礎學習筆記

分享《Python 3網絡爬蟲開發實戰》中文PDF+源代碼

muduo網絡庫學習筆記(三)TimerQueue定時器隊列

muduo網絡庫學習筆記(四) 通過eventfd實現的事件通知機制

muduo網絡庫學習筆記(五) 鏈接器Connector與監聽器Acceptor

分享百度雲鏈接 Python 3網絡爬蟲開發實戰 ,崔慶才著

用Python寫網絡爬蟲（高清版）PDF

python：網絡爬蟲的學習筆記

相關推薦