python抓取

阿新 • • 發佈：2017-06-12

info 奧巴馬 www word ref str source div term

我要抓取奧巴馬每周的演講內容http://www.putclub.com/html/radio/VOA/presidentspeech/index.html

如果手動提取，就需要一個個點進去，再復制保存，非常麻煩。

那有沒有一步到位的方法呢，用python這種強大的語言就能快速實現。

首先我們看看這網頁的源碼

技術分享

可以發現，我們要的信息就在這樣技術分享一小條url中。

更具體點說，就是我們要遍歷每個類似http://www.putclub.com/html/radio/VOA/presidentspeech/2014/0928/91326.html這樣的網址，而這網址需要從上面的網頁中提取。

好，開始寫代碼

首先打開這個目錄頁，保存在content

[python] view plain copy

import sys,urllib
url="http://www.putclub.com/html/radio/VOA/presidentspeech/index.html"
wp = urllib.urlopen(url)
print "start download..."
content = wp.read()

下面要提取出每一篇演講的內容

具體思路是搜索“center_box”之後，每個“href=”和“target”之間的內容。為什麽是這兩個之間，請看網頁源碼。

得到的就是每一篇的url，再在前面加上www.putclub.com就是每一篇文章的網址啦

[html] view plain copy

print content.count("center_box")
index = content.find("center_box")
content=content[content.find("center_box")+1:]
content=content[content.find("href=")+7:content.find("target")-2]
filename = content
url ="http://www.putclub.com/"+content
print content
print url

wp = urllib.urlopen(url)
print "start download..."
content = wp.read()

有了文章內容的url後，同樣的方法篩選內容。

[python] view plain copy

#print content
print content.count("<div class=\"content\"")
#content = content[content.find("<div class=\"content\""):]
content = content[content.find(""):]
content = content[:content.find("<div class=\"dede_pages\"")-1]
filename = filename[filename.find("presidentspeech")+len("presidentspeech/"):]

最後再保存並打印

[python] view plain copy

filename = filename.replace(‘/‘,"-",filename.count("/"))
fp = open(filename,"w+")
fp.write(content)
fp.close()
print content

OK，大功告成！保存成.pyw文件，以後只需雙擊就直接保存下了obama每周演講內容~

python抓取

Python抓取學院新聞報告

滿足 imp 實驗源代碼 ges tail view paste rom Python案例 scrapy抓取學院新聞報告任務抓取四川大學公共管理學院官網(http://ggglxy.scu.edu.cn)所有的新聞咨詢. 實驗流程 1.確定抓取目標.2.制定抓取規則.

python抓取

info 奧巴馬 www word ref str source div term 我要抓取奧巴馬每周的演講內容http://www.putclub.com/html/radio/VOA/presidentspeech/index.html 如果手動提取，就需要一個個點進去

python抓取bing主頁背景圖片

replace utf bytes for json格式 module imp urlopen 有變最初Python2寫法： #!/usr/bin/env python # -*- coding:utf-8 -*- # -*- author:nancy -*- # pyt

無比強大！Python抓取cssmoban站點的模版並下載

jea blank file timeout 全局 -- 文件的 pre target Python實現抓取http://www.cssmoban.com/cssthemes站點的模版並下載實現代碼 # -*- coding: utf-8 -*- im

Python抓取手機APP中內容

quest 手機app 開始 clas tex json 完成 keep 抓取首先下載Wireshark和模擬器（天天模擬器，夜神模擬器），天天模擬器在自帶的應用商店裏面能夠登錄微信。然後打開Wireshark選擇一個網卡開始抓包。開始抓包後，在模擬器中要抓取的APP

python 抓取cisco交換機配置文件

cal pytho quit sys led ... eof tex passwd #!/usr/bin/python import sys import time import os import pexpect now = time.strftime(‘%Y-%

用python 抓取B站視頻評論，制作詞雲

port mil query 雲圖 ges cal 爬取 close hid python 作為爬蟲利器，與其有很多強大的第三方庫是分不開的，今天說的爬取B站的視頻評論，其實重點在分析得到的評論化作嵌套的字典，在其中取出想要的內容。層層嵌套，眼花繚亂，分析時應細致！步驟分為

Python抓取數據的幾種方式

cnblogs 方式 edit api lencod nco financial 取數 .org import urllib.requestresponse = urllib.request.urlopen(‘http://python.org/‘)html = res

python 抓取電影天堂電影信息放入數據庫

python mysql 電影 # coding:utf-8 import requests from bs4 import BeautifulSoup from multiprocessing import Pool import urllib2 import re import json im

python 抓取"一個"網站文章信息放入數據庫

python 文章爬蟲 # coding:utf-8 import requests from bs4 import BeautifulSoup import json import time import datetime import pymysql import sys reload(sy

python 抓取內涵段子

爬蟲#!/usr/bin/env python #coding:utf-8 import requests,io,time from bs4 import BeautifulSoup def neihanjoke(): headers = { 'Accept':

Python抓取遠程文件獲取真實文件名

pen AR name position 遠程文件 head con get log 用urllib下載遠程文件並轉存到hdfs服務器，在下載時，下載地址中不一定包含文件名，需要從連接信息中獲取。 1 file_url = request.form.get(

python: 抓取免費代理ip

python 抓取免費代理ip通過抓取西刺網免費代理ip實現代理爬蟲： from bs4 import BeautifulSoup import requests import random import telnetlib requests = requests.session() ip_list = []

Python 抓取網頁gb2312亂碼問題

發現 file read earch () spa .com pycharm close python 爬取學校所有人四六級成績時發現爬出網頁中文亂碼遂google 得到一解決方案 # -*- coding:utf8 -*- import urllib2

Python - 抓取豆列

nco style user != 收藏 day TP lis paginator 將豆列導出為 Markdown 文件。 #!/usr/bin/env python #! encoding=utf-8 # Description : 將豆列導出為 Markdown

Python抓取京東商品信息

Python抓取京東商品信息打開網頁http://item.jd.com/7336413.html定位到“規格與包裝” Python抓取京東商品信息

Python抓取新浪新聞數據（二）

Python抓取新浪新聞數據以下是抓取的完整代碼(抓取了網頁的title,newssource,dt,article,editor,comments)舉例：Python抓取新浪新聞數據（二）

Python抓取新浪新聞數據（三）

Python抓取新浪新聞數據非同步載入一般在XHR下查找，但是沒有發現XHR下有相關內容。 Python抓取新浪新聞數據（三）

《一出好戲》講述人性，使用Python抓取貓眼近10萬條評論並分析，一起揭秘“這出好戲”到底如何？

generate pro hand stk 同時 readlines 看電影就是 msh 黃渤首次導演的電影《一出好戲》自8月10日在全國上映，至今已有10天，其主演陣容強大，相信許多觀眾也都是沖著明星們去的。目前《一出好戲》在貓眼上已經獲得近60萬個評價，評分為8.2

房東要給我漲800房租，生氣的我用Python抓取帝都幾萬套房源信息，我主動漲了1000。

__init__ tar extend 簡單 not in 詳細分布 obj soho 老貓我在南五環租了一個80平兩居室，租房合同馬上到期，房東打電話問續租的事，想要加房租；我想現在國家正在也在抑制房價，房子價格沒怎麽漲，房租應該也不會漲，於是霸氣拒絕了，以下是聊天記錄

python抓取

相關推薦