數據結構化和保存

阿新 • • 發佈：2018-04-12

ret url mode rst spl pri HA AD itl

1. 將新聞的正文內容保存到文本文件。

newscontent=soup.select(‘.show-content‘)[0].text
f=open(‘new.txt‘,‘w‘)
f.write(newscontent)
f=open(‘new.txt‘,‘r‘)
print(f.read())

2. 將新聞數據結構化為字典的列表:

單條新聞的詳情-->字典news
一個列表頁所有單條新聞匯總-->列表newsls.append(news)
所有列表頁的所有新聞匯總列表newstotal.extend(newsls)

import requests
from bs4 import BeautifulSoup
import re
import pandas
firstpage=‘http://news.gzcc.cn/html/xiaoyuanxinwen/‘
url=‘http://news.gzcc.cn/html/2018/xiaoyuanxinwen_0411/9205.html‘
res = requests.get(firstpage)
res.encoding = ‘utf-8‘
soup1 = BeautifulSoup(res.text, ‘html.parser‘)
newscount = int(soup1.select(‘.a1‘)[0].text.rstrip(‘條‘))
newcount1 = newscount // 10 + 1
allnews=[]
def getallnews(url,allnews):
res = requests.get(url)
res.encoding = ‘utf-8‘
soup1 = BeautifulSoup(res.text, ‘html.parser‘)
news=soup1.select(‘.news-list‘)[0].select(‘li‘)
for i in news:
news1=i.a.attrs[‘href‘]
title=gettitle(news1)
datetime=getdatetime(news1)
sorce=getsource(news1)
newsurl3 = re.search(‘(\d{2,}\.html)‘, news1).group(1)
newsurl4 = newsurl3.rstrip(‘.html‘)
newid = ‘http://oa.gzcc.cn/api.php?op=count&id=‘ + newsurl4 + ‘&modelid=80‘
clickcount=getclickcount(newid)
dictionary={}
dictionary[‘clickcount‘] = clickcount
dictionary[‘title‘]=title
dictionary[‘datetime‘]=datetime
dictionary[‘source‘]=sorce
allnews.append(dictionary)
return allnews
def getclickcount(newurl):
res=requests.get(newurl)
res.encoding=‘utf-8‘
soup=BeautifulSoup(res.text,‘html.parser‘).text
click=soup.split(‘.html‘)
res5 = int(click[-1].lstrip("(‘").rstrip("‘);"))
return res5
def gettitle(newsurl):
res=requests.get(newsurl)
res.encoding=‘utf-8‘
soup = BeautifulSoup(res.text, ‘html.parser‘)
title=soup.select(‘.show-title‘)[0].text
return title
def getdatetime(newurl):
res=requests.get(newurl)
res.encoding = ‘utf-8‘
soup = BeautifulSoup(res.text, ‘html.parser‘)
t3 = soup.select(‘.show-info‘)[0].text
t4 = t3.split()
t5 = t4[0].lstrip(‘發布時間:‘)
datetime1 = t5 + ‘ ‘ + t4[1]
return datetime1
def getsource(newurl):
res=requests.get(newurl)
res.encoding = ‘utf-8‘
soup = BeautifulSoup(res.text, ‘html.parser‘)
t3 = soup.select(‘.show-info‘)[0].text
t4 = t3.split()
t5=t4[4].lstrip(‘來源：‘)
return t5
for i in range(2,6):
pageurl=‘http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html‘.format(i)
hao=getallnews(pageurl,allnews)
df = pandas.DataFrame(hao)
print(df)

3. 安裝pandas，用pandas.DataFrame(newstotal)，創建一個DataFrame對象df.

df = pandas.DataFrame(hao)
print(df)

4. 通過df將提取的數據保存到csv或excel 文件。

df.to_excel(‘text.xlsx‘)

5. 用pandas提供的函數和方法進行數據分析：

提取包含點擊次數、標題、來源的前6行數據

print(df.head(6))
提取‘學校綜合辦’發布的，‘點擊次數’超過3000的新聞。

wa=df[(df[‘clickcount‘]>2000)&(df[‘source‘]==‘學校綜合辦‘)]
print(wa)

提取‘國際學院‘和‘學生工作處‘發布的新聞。

sorcelist=[‘國際學院‘,‘學校工作處‘]
specialnews=df[df[‘source‘].isin(sorcelist)]
print(specialnews)
specialnews.to_excel(‘hello.xlsx‘)

數據結構化和保存

ret url mode rst spl pri HA AD itl 1. 將新聞的正文內容保存到文本文件。 newscontent=soup.select(‘.show-content‘)[0].textf=open(‘new.txt‘,‘w‘)f.write(newsc

數據結構化和保存

數據結構化和保存

數據結構化與保存

第九篇數據表設計和保存item到json文件

信息技術手冊可視化進度報告基於BeautifulSoup框架的python3爬取數據並連接保存到MySQL數據庫

查詢一個月最後一天的總用戶數，數據庫中沒有保存最好一天的數據，就查詢本月數據庫已存有的最後一天的數據

數據結構--DFS和BFS

js實現存儲對象的數據結構hashTable和list

數據結構簡介和算法效率度量

基本數據結構 - 棧和隊列

MySQL數據庫MyISAM和InnoDB存儲引擎的對比

數據結構——棧和隊列相關算法實現

關於 redis 的數據類型和內存模型

數據結構(03)_順序存儲結構線性表

數據結構---線性表---順序存儲結構

C#數據結構—棧和隊列

數據結構——圖和排序習題及答案

暑假集訓8.7數據結構專題-線段樹存直線

數據結構-線性表順序存儲（c++）

數據結構學習筆記（二）線性表的順序存儲和鏈式存儲

結構化、半結構化和非結構化數據

數據結構化和保存

相關推薦