python：爬取新浪新聞的內容

阿新 • • 發佈：2018-12-07


import requests
import json
from bs4 import BeautifulSoup
import re
import pandas
import sqlite3


commenturl='https://comment.sina.com.cn/page/info?version=1&format=json' \
           '&channel=gn&newsid=comos-{}&group=undefined&compress=0&' \
           'ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread' \
           '=1&callback=jsonp_1543748934208'
# 獲取評論數
def getCommentCounts(newsurl):
    #獲取沒則新聞的編號（正則表示式）
    m = re.search('doc-i(.*).shtml', newsurl)
    newsid = m.group(1)
    #格式化連結中的的大括號
    comments = requests.get(commenturl.format(newsid))
    #把加了js外套的json變成標準json
    jd = json.loads(comments.text.strip('jsonp_1543748934208()'))
    #獲取評論數
    return jd['result']['count']['total'];

# 提取每則新聞的內文
def getNewsDetail(newsurl):
    # 定義一個字典儲存資訊
    result = {}
    rsp = requests.get(newsurl)
    rsp.encoding = 'utf-8'
    soup = BeautifulSoup(rsp.text,'html.parser')
    # 獲取標題
    result['title'] = soup.select('.main-title')[0].text
    # 獲取日期
    result['time'] = soup.select('.date')[0].text
    # 獲取來源
    result['source'] = soup.select('.source')[0].text
    # 獲取內容
    result['article'] = ' '.join([p.text.strip() for p in soup.select('#article p')[:-1]])
    # 獲取編輯
    result['editor'] = soup.select('.show_author')[0].text.lstrip('責任編輯：')
    # 獲取評論數
    result['comment']=getCommentCounts(newsurl)
    return result

# 獲取分頁連結
def parseListLinks(url):
    newsdetails = []
    rsp = requests.get(url)
    # 把加了js外套的json變成標準json
    jsonUrl = '{' + str(rsp.text.lstrip('try{feedCardJsonpCallback(').rstrip(') ;}catch(e){};')) + '}}'
    jd=json.loads(jsonUrl)
    # 獲取每頁的新聞連結
    for ent in jd['result']['data']:
        newsdetails.append(getNewsDetail(ent['url']))
    return newsdetails

url='https://feed.sina.com.cn/api/roll/' \
    'get?pageid=121&lid=1356&num=20&versionNumber=1.2.4' \
    '&page={}&encode=utf-8&callback=feedCardJsonpCallback&_'
news_total = []
for i in range(1,3):#爬取的頁數自己設定
    # 格式化連結中的大括號
    newsurl = url.format(i)
    newsary = parseListLinks(newsurl)
    news_total.extend(newsary)
# 使用pandas模組使爬取到的資訊格式化
df = pandas.DataFrame(news_total)
# 儲存為xlsx檔案
df.to_excel('news.xlsx')

python：爬取新浪新聞的內容

import requests import json from bs4 import BeautifulSoup import re import pandas import sqlite3 commenturl='https://comment.sina.com.cn/page/info?

python爬蟲爬取新浪新聞的評論數以及部分評論

首先應該去找到評論數所對應的網頁元素：可以大致猜測，這裡是用JavaScript·去計算評論數量的。重新整理頁面，去觀測頁面的js部分，有沒有對應的連結，仔細檢視：找到之後，點選Preview，看到內部結構：可以看出count部分，total代表了參與人數，show欄位代

70行python程式碼爬取新浪財經中股票歷史成交明細

最近在研究股票量化，想從每筆成交的明細著手，但歷史資料的獲取便是一個大問題，一些股票證券軟體又不能批量匯出成交資料。所以，我花了兩天時間，成功的從新浪財經爬取了我要的資料下面開始新浪股票明細資料介面為格式不用多說symbol=股票程式碼 date=日期 pa

Python爬取新浪微博用戶信息及內容

pro 目標 oss 來源但是 blog .com 交流 exc 新浪微博作為新時代火爆的新媒體社交平臺，擁有許多用戶行為及商戶數據，因此需要研究人員都想要得到新浪微博數據，But新浪微博數據量極大，獲取的最好方法無疑就是使用Python爬蟲來得到。網上有一些關於使用Py

[python爬蟲] Selenium爬取新浪微博內容及使用者資訊

登入入口新浪微博登入常用介面：http://login.sina.com.cn/ 對應主介面：http://weibo.com/但是個人建議採用手機端微博入口：http://login.weibo.cn/login/ 其原因是手機端資料相對更輕量型，同時基本資料都齊全，可能缺少些個人基本資訊，如"個人資料

Python 爬蟲實例（7）—— 爬取新浪軍事新聞

secure host agen cat hand .com cati ica sts 我們打開新浪新聞，看到頁面如下，首先去爬取一級 url，圖片中藍色圓圈部分第二zh張圖片，顯示需要分頁，

python 爬取新浪網站 NBA球員最近2個賽季庫裡前20場資料

1. 分析新浪網站中球員資料的獲取方式(F12 開發者模式，除錯網頁)：一般網站儲存資料的方式分為2種：1. 靜態網頁儲存；2. 動態請求；對於靜態網頁儲存來說，就是開啟瀏覽器中檢視原始碼，就可以從原始碼中獲取所需要的資料；對於動態請求來說，採用F12的開發者模式中，才能從伺服器的

python爬取新浪股票資料—繪圖【原創分享】

目標：不做蠟燭圖，只用折線圖繪圖，繪出四條線之間的關係。注：未使用介面，僅爬蟲學習，不做任何違法操作。 1 """ 2 新浪財經，爬取歷史股票資料 3 """ 4 5 # -*- coding:utf-8 -*- 6 7 import num

用python寫網路爬蟲-爬取新浪微博評論

新浪微博需要登入才能爬取，這裡使用m.weibo.cn這個移動端網站即可實現簡化操作，用這個訪問可以直接得到的微博id。分析新浪微博的評論獲取方式得知，其採用動態載入。所以使用json模組解析json程式碼單獨編寫了字元優化函式，解決微博評論中的嘈雜干擾

用網路爬蟲爬取新浪新聞----Python網路爬蟲實戰學習筆記

今天學完了網易雲課堂上Python網路爬蟲實戰的全部課程，特在此記錄一下學習的過程中遇到的問題和學習收穫。我們要爬取的網站是新浪新聞的國內版首頁下面依次編寫各個功能模組 1.得到某新聞頁面下的評論數評論數的資料是個動態內容，應該是存在伺服器

Python利用xpath和正則re爬取新浪新聞

今天我們來進行簡單的網路爬蟲講解:利用用from lxml import html庫+Xpath以及requests庫進行爬蟲 1.我們將爬取新浪微博首頁要聞我們摁F12檢視網頁原始碼查詢要聞內容所對應的HTML的程式碼通過觀察我們可以發現每個標題都在<h1 data-client

python爬取新浪財經的股票資訊

import requests import threading def display_info(code): url = 'http://hq.sinajs.cn/list=' + code response = requests.get(url).t

【python 新浪微博爬蟲】python 爬取新浪微博24小時熱門話題top500

一、需求分析模擬登陸新浪微博,爬取新浪微博的熱門話題版塊的24小時內的前TOP500的話題名稱、該話題的閱讀數、討論數、粉絲數、話題主持人，以及對應話題主持人的關注數、粉絲數和微博數。二、開發語言 python2.7 三、需要匯入模組 import

爬取新浪微博使用者的個人資訊和微博內容

#-*- coding:utf-8 -*- """ 爬取新浪微博的使用者資訊功能：使用者ID 使用者名稱粉絲數關注數微博數微博內容網址：www.weibo.cn 資料量更少相對於 www.weibo.cn """ import time impo

python3爬取新浪新聞內容

程式碼如下： #commentsUrl用於獲取新聞評論數等json資訊 commentsUrl = 'http://comment5.news.sina.com.cn/page/info?vers

Python爬取新浪微信評論，瞭解一下

環境： Python3 + windows。開發工具：Anaconda + Jupyter / VS Code/pycharm/sublime等等都可以（你開心就好）學習效果：認識爬蟲 / Robots協議瞭解瀏覽器開發者工具動態載入頁面

用python爬取新浪微博資料（無需手動獲取cookie)

從java 轉為python from selenium import webdriver import selenium from selenium.webdriver.common.desired_capabilities import DesiredCapabi

requests, Beautifusoup 爬取新浪新聞資訊

int 爬取 eight tex import soup imp encoding 資訊 import requestsfrom bs4 import BeautifulSoupres = requests.get(‘http://news.sina.com.cn/chin

4-15 爬取新浪網

xlsx size text num mos das rip bs4 page import requests 3 from bs4 import BeautifulSoup 4 from datetime import datetime 5 import re 6

爬取新浪新聞

通過scrapy startproject xinlang爬蟲專案：通過scrapy genspider sina "sina.com.cn" 建立spider 建立Items spider: pipelines:

python：爬取新浪新聞的內容

相關推薦