1. 程式人生 > >數據結構化和保存

數據結構化和保存

ret url mode rst spl pri HA AD itl

1. 將新聞的正文內容保存到文本文件。

newscontent=soup.select(‘.show-content‘)[0].text
f=open(‘new.txt‘,‘w‘)
f.write(newscontent)
f=open(‘new.txt‘,‘r‘)
print(f.read())

2. 將新聞數據結構化為字典的列表:

單條新聞的詳情-->字典news
一個列表頁所有單條新聞匯總-->列表newsls.append(news)
所有列表頁的所有新聞匯總列表newstotal.extend(newsls)

import requests
from bs4 import BeautifulSoup
import re
import pandas
firstpage=‘http://news.gzcc.cn/html/xiaoyuanxinwen/‘
url=‘http://news.gzcc.cn/html/2018/xiaoyuanxinwen_0411/9205.html‘
res = requests.get(firstpage)
res.encoding = ‘utf-8‘
soup1 = BeautifulSoup(res.text, ‘html.parser‘)
newscount = int(soup1.select(‘.a1‘)[0].text.rstrip(‘條‘))
newcount1 = newscount // 10 + 1
allnews=[]
def getallnews(url,allnews):
res = requests.get(url)
res.encoding = ‘utf-8‘
soup1 = BeautifulSoup(res.text, ‘html.parser‘)
news=soup1.select(‘.news-list‘)[0].select(‘li‘)
for i in news:
news1=i.a.attrs[‘href‘]
title=gettitle(news1)
datetime=getdatetime(news1)
sorce=getsource(news1)
newsurl3 = re.search(‘(\d{2,}\.html)‘, news1).group(1)
newsurl4 = newsurl3.rstrip(‘.html‘)
newid = ‘http://oa.gzcc.cn/api.php?op=count&id=‘ + newsurl4 + ‘&modelid=80‘
clickcount=getclickcount(newid)
dictionary={}
dictionary[‘clickcount‘] = clickcount
dictionary[‘title‘]=title
dictionary[‘datetime‘]=datetime
dictionary[‘source‘]=sorce
allnews.append(dictionary)
return allnews
def getclickcount(newurl):
res=requests.get(newurl)
res.encoding=‘utf-8‘
soup=BeautifulSoup(res.text,‘html.parser‘).text
click=soup.split(‘.html‘)
res5 = int(click[-1].lstrip("(‘").rstrip("‘);"))
return res5
def gettitle(newsurl):
res=requests.get(newsurl)
res.encoding=‘utf-8‘
soup = BeautifulSoup(res.text, ‘html.parser‘)
title=soup.select(‘.show-title‘)[0].text
return title
def getdatetime(newurl):
res=requests.get(newurl)
res.encoding = ‘utf-8‘
soup = BeautifulSoup(res.text, ‘html.parser‘)
t3 = soup.select(‘.show-info‘)[0].text
t4 = t3.split()
t5 = t4[0].lstrip(‘發布時間:‘)
datetime1 = t5 + ‘ ‘ + t4[1]
return datetime1
def getsource(newurl):
res=requests.get(newurl)
res.encoding = ‘utf-8‘
soup = BeautifulSoup(res.text, ‘html.parser‘)
t3 = soup.select(‘.show-info‘)[0].text
t4 = t3.split()
t5=t4[4].lstrip(‘來源:‘)
return t5
for i in range(2,6):
pageurl=‘http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html‘.format(i)
hao=getallnews(pageurl,allnews)
df = pandas.DataFrame(hao)
print(df)


3. 安裝pandas,用pandas.DataFrame(newstotal),創建一個DataFrame對象df.

df = pandas.DataFrame(hao)
print(df)

4. 通過df將提取的數據保存到csv或excel 文件。

df.to_excel(‘text.xlsx‘)

5. 用pandas提供的函數和方法進行數據分析:

提取包含點擊次數、標題、來源的前6行數據

print(df.head(6))
提取‘學校綜合辦’發布的,‘點擊次數’超過3000的新聞。

wa=df[(df[‘clickcount‘]>2000)&(df[‘source‘]==‘學校綜合辦‘)]
print(wa)

提取‘國際學院‘和‘學生工作處‘發布的新聞。

sorcelist=[‘國際學院‘,‘學校工作處‘]
specialnews=df[df[‘source‘].isin(sorcelist)]
print(specialnews)
specialnews.to_excel(‘hello.xlsx‘)

數據結構化和保存