1. 程式人生 > >pyhton爬蟲(8)——獲取網易新聞內容

pyhton爬蟲(8)——獲取網易新聞內容

本文主要目的是獲取網易新聞標題正文內容。實現程式碼如下所示:

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 17 15:46:30 2017

@author: Administrator
"""
from bs4 import BeautifulSoup
import urllib.request
import http.cookiejar

#url = 'http://news.163.com/17/0717/10/CPHORRIE0001899O.html'
url = 'http://news.163.com/17/0717/16/CPIES9NG000187V9.html'

'''
1.將網易新聞頁面以html的形式儲存到本地
'''
#以字典的形式設定headers headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0", "Connection": "keep-alive", "referer"
: "http://www.163.com/"} #設定cookie cjar = http.cookiejar.CookieJar() proxy = urllib.request.ProxyHandler({'http':"127.0.0.1:8888"}) opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler,urllib.request.HTTPCookieProcessor(cjar)) #建立空列表,為了以指定格式儲存頭資訊 headall = [] #通過for迴圈遍字典,構造出指定格式的Headers資訊
for key,value in headers.items(): item = (key,value) headall.append(item) #將指定格式的headers資訊新增好 opener.addheaders = headall #將opener安裝為全域性 urllib.request.install_opener(opener) data = urllib.request.urlopen(url).read() fhandle = open("D:/python/data/163/1.html","wb") fhandle.write(data) fhandle.close() ''' 2.提取網易新聞標題和正文內容資訊 ''' html1 = urllib.request.urlopen(url).read().decode('gbk') html1 = str(html1) soup1 = BeautifulSoup(html1,'lxml') #提取新聞標題 result1 = soup1.find_all("h1") title = result1[0].string print("新聞標題為:{}".format(title)) soup2 = BeautifulSoup(html1,'lxml') #提取正文所在區塊 result2 = soup1.find_all(attrs={"class":"post_text"}) result2 = str(result2) #print(result2) soup3 = BeautifulSoup(html1,'lxml') #提取正文文字內容 result3= soup1.find_all("p") content = result3[5:8] print("新聞正文內容為:") for i in content: print(i.string)

實現結果如下圖所示:

這裡寫圖片描述

本文只實現了網易新聞內容的簡單提取,但正文資訊提取時還需要手動設定區間範圍,不夠靈活,還有待進一步完善。