python爬蟲，抓取新浪科技的文章（beautifulsoup+mysql）

阿新 • • 發佈：2018-12-22

這幾天的辛苦沒有白費，總算完成了對新浪科技的文章抓取，除非沒有新的內容了，否則會一直爬取新浪科技的文章。

想了解更多可以關注我的github:https://github.com/libp/WebSpider

如果想要資料庫表結構可以留下郵箱~

# -*- coding: utf-8 -*-


__author__ = 'Peng'
from bs4 import BeautifulSoup,Comment
import urllib2
from urllib2 import urlopen,HTTPError
import MySQLdb
import json
import datetime
import logging
import sys
import re
import time

#配置日誌輸出位置為控制檯
logging.basicConfig(level=logging.DEBUG,
                format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                datefmt='%a, %d %b %Y %H:%M:%S',
                stream=sys.stdout)


def spiderSinaTech(url,webname):
     conn = getConn();
     cur = conn.cursor()

     data = getSinaArticle(url,webname)
     if (data == None):
         #不能解析目標網頁
         return -1
     try:
         sqlInsertArticle="insert into tbl_peng_article (title,author,content,createTime,getTime,url,webname) values (%s,%s,%s,%s,%s,%s,%s)"
         result = cur.execute(sqlInsertArticle,(data['title'],data['author'],data['article'],data['published_time'],data['getTime'],data['url'],data['webname']))
     except MySQLdb.Error,e:
         print "Mysql Error %d: %s" % (e.args[0], e.args[1])
     conn.commit()
     cur.close()
     conn.close()
     return result


def getSinaArticle(url,webname):
    #建立字典用來儲存函式的返回結果
    dict={'url':url,'title':'','published_time':'','getTime':'','author':'','article':'','webname':webname}

    #建立請求頭
    headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 "
                          "(KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36",
             "Accept":"*/*"}

    #開啟網頁
    try:
        dict['url']=url
        request = urllib2.Request(url,headers=headers)
        html = urlopen(request)
    except HTTPError as e:
        print(e)
    #讀取網頁內容並轉換成樹形文件結構
    soup = BeautifulSoup(html.read(),"lxml")

    #去除html註釋
    for element in soup(text=lambda text: isinstance(text, Comment)):
        element.extract()

    #過濾JavaScript
    [s.extract() for s in soup('script')]

    try:
        #獲取標題
        title = soup.find(id="main_title").get_text();
        # print(title)
        dict['title'] = title
    except:
        return None

    #獲取釋出時間
    published_time = soup.find(property="article:published_time")['content'];
    #2017-06-03T11:31:53+08:00   這種時間格式叫UTC時間格式...很噁心
    # print(published_time)
    UTC_FORMAT = "%Y-%m-%dT%H:%M:%S+08:00"
    dict['published_time'] = datetime.datetime.strptime(published_time, UTC_FORMAT)

    #獲取作者
    author = soup.find(property="article:author")['content'];
    # print(author)
    dict['author'] = author

    #獲取文章主體
    content = soup.find(id="artibody");
    img = content.find_all(class_="img_wrapper")
    #刪除文件書中圖片標籤
    for del_img in img:
        del_img.decompose()

    #獲取文章主體各個段落
    paragraph = soup.find(id="artibody").contents;

    #最終入庫的文章內容
    article =""
    for child in paragraph:
        article += str(child)
    # print(article)
    dict['article'] = article
    # print json.dumps(dict)
    # date在轉換成json的時候包括，需要重構date轉換的函式
    # return json.dumps(dict)

    #文章抓取時間
    dict['getTime']=str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
    return dict

def getConn():
     conn= MySQLdb.connect(
        host='localhost',
        port = 3306,
        user='root',
        passwd='root',
        db ='nichuiniu',
        charset='utf8',
        )
     return conn

def GOSina(url,webname):
    #建立連結集合
    # pages = set()
    #建立字典用來儲存函式的返回結果
    # dict={'url':url,'title':'','published_time':'','getTime':'','author':'','article':'','webname':webname}

    #建立請求頭
    headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 "
                          "(KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36",
             "Accept":"*/*"}

    #開啟網頁
    try:
        request = urllib2.Request(url,headers=headers)
        html = urlopen(request)
    except HTTPError as e:
        print(e)
    #讀取網頁內容並轉換成樹形文件結構
    soup = BeautifulSoup(html.read(),"lxml")
    conn = getConn();
    cur = conn.cursor()
    #宣告一個數組用來儲存入庫的文章連結
    L = []
    for link in soup.findAll("a",href=re.compile(r'(.*?)(tech)(.*?)(\d{4}-\d{2}-\d{2})(/doc-ify)')):

        if 'href' in link.attrs:
            #提取href中的url，並規範格式去除分頁引數
            xurl = re.compile(r'(.*?shtml)').search(link.attrs['href']).group(1)
            sqlQueryUrl="select * from tbl_peng_article where url='%s'"%xurl
            # print link.attrs['href']
            result = cur.execute(sqlQueryUrl)
            conn.commit()
            if ( result == 0 ):
                # data = getSinaArticle(url,webname)
                rs = spiderSinaTech(xurl,webname)
                if( rs > 0 ):
                    logging.info("----URL has insert into database :%s"%xurl)
                    L.append(xurl)
                    time.sleep( 2 )
                elif( rs == -1):
                    logging.info("****URL content cannt be understand %s"%xurl)
            else :
                logging.info("&&&&URL already in database %s"%xurl)
    cur.close()
    conn.close()
    #如果不為空就返回最後一個url，為空則停止抓取
    if L:
        return L[-1]
    else:
        return 0

logging.info("begin spider sina tech")
url="http://tech.sina.com.cn/it/2017-06-07/doc-ifyfuzny3756083.shtml"
webname="sina"
x = GOSina(url,webname)
if x!= 0:
    GOSina(x,webname)

logging.info("end spider sina tech")

python爬蟲，抓取新浪科技的文章（beautifulsoup+mysql）

這幾天的辛苦沒有白費，總算完成了對新浪科技的文章抓取，除非沒有新的內容了，否則會一直爬取新浪科技的文章。想了解更多可以關注我的github:https://github.com/libp/WebSpider 如果想要資料庫表結構可以留下郵箱~ # -*- coding:

[python爬蟲] Selenium爬取新浪微博內容及使用者資訊

登入入口新浪微博登入常用介面：http://login.sina.com.cn/ 對應主介面：http://weibo.com/但是個人建議採用手機端微博入口：http://login.weibo.cn/login/ 其原因是手機端資料相對更輕量型，同時基本資料都齊全，可能缺少些個人基本資訊，如"個人資料

python爬蟲之利用scrapy框架抓取新浪天氣資料

scrapy中文官方文件：點選開啟連結Scrapy是Python開發的一個快速、高層次的螢幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的資料。Scrapy用途廣泛，可以用於資料探勘、監測和自動化測試，Scrapy吸引人的地方在於它是一個框架，任何人都可以根據

Python抓取新浪新聞數據（二）

Python抓取新浪新聞數據以下是抓取的完整代碼(抓取了網頁的title,newssource,dt,article,editor,comments)舉例：Python抓取新浪新聞數據（二）

Python抓取新浪新聞數據（三）

Python抓取新浪新聞數據非同步載入一般在XHR下查找，但是沒有發現XHR下有相關內容。 Python抓取新浪新聞數據（三）

python抓取新浪新聞的分頁連結

第一步：先找到新聞資訊存在的那個非同步存取的連結，該連結一般位在js那個分類下。然後把這個連結給requests 讓它存取內部的資料。取到之後你會發現，這個內容前後兩邊有保護層，即一個“(”和 “);”,這個時候可以用lstrip和rstrip去截掉這些多餘的字串。最後返回的就是

基於scrapy的分散式爬蟲抓取新浪微博個人資訊和微博內容存入MySQL

為了學習機器學習深度學習和文字挖掘方面的知識，需要獲取一定的資料，新浪微博的大量資料可以作為此次研究歷程的物件一、環境準備 python 2.7 scrapy框架的部署（可以檢視上一篇部落格的簡要操作，傳送門：點選開啟連結） mysql的部署（需要的資源

ptython3+mysql爬蟲抓取新浪新聞

一、安裝套件 1、pip install requests 2、pip install BeautifulSoup4 二、剖析網頁元素 soup = BeautifulSoup(reshtml,'html.parser').select('.news-item') 三、安裝

python抓取新浪微博評論並分析

1，實現效果 2，資料庫 3，主要步驟 1，輸入賬號密碼，模擬新浪微博登陸 2，抓取評論頁的內容 3，用正則表示式過濾出使用者名稱，評論時間和評論內容 4，將得到的內容存入資料庫 5，用SQL語句實現其他功能：例如統計評論次數等 4，詳細步驟 # -*- codi

Python爬蟲：抓取手機APP的數據

sig ner ont sele ebo span fail pytho 抓取摘要: 大多數APP裏面返回的是json格式數據，或者一堆加密過的數據。這裏以超級課程表APP為例，抓取超級課程表裏用戶發的話題。 1、抓取APP數據包方法詳細可以參考這篇博文：

用Selenium抓取新浪天氣

空氣 rom cell parse beautiful 西北風 port $path 系統環境（1）用Selenium抓取新浪天氣系統環境：操作系統：macOS 10.13.6 python ：2.7.10 用虛擬環境實現一、創建虛擬環境： mkvirtua

python爬蟲之抓取代理伺服器IP

轉載請標明出處： http://blog.csdn.net/hesong1120/article/details/78990975 本文出自:hesong的專欄前言使用爬蟲爬取網站的資訊常常會遇到的問題是，你的爬蟲行為被對方識別了，對方把你的IP遮蔽了，返回

python爬蟲，爬取豆瓣電影《芳華》電影短評，分詞生成雲圖。

專案github地址：https://github.com/kocor01/spider_cloub/ Python版本為3.6 最近突然想玩玩雲圖，動手寫了個簡單的爬蟲，搭建了簡單的爬蟲架構爬蟲爬取最近比較火的電影《芳華》分詞後生成雲圖使用了 jieba分詞，雲圖用word

python爬蟲，爬取貓眼電影top100

import requests from bs4 import BeautifulSoup url_list = [] all_name = [] all_num = [] all_actor = [] all_score = [] class Product_url():

【python爬蟲】抓取連結網頁內的文字（第一步定位超連結文字）

第一步：匯入模組>>> import re >>> from bs4 import BeautifulSoup >>> import urllib.request ---------------------------

Python爬蟲：抓取手機APP資料

1、抓取APP資料包得到超級課程表登入的地址：http://120.55.151.61/V2/StudentSkip/loginCheckV4.action 表單：表單中包括了使用者名稱和密碼，當然都是加密過了的，還有一個裝置資訊，直接

python爬蟲：抓取頁面上的超連結

Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫.它能夠通過你喜歡的轉換器實現慣用的文件導航,查詢,修改文件的方式.Beautiful Soup會幫你節省數小時甚至數天的工作時間. 頁面上的超連結在HTML中，超

Python爬蟲：抓取內涵段子1000張搞笑圖片-上篇（小爬蟲誕生篇）

出於興趣，在《幕課網：Python 開發簡單爬蟲》上學習了點兒 Python 爬蟲的入門知識，跟著視訊教程抓取了百度百科的 1000 個頁面。然後自己嘗試抓取一個國外網站的資料，但可能是由於最近召開

php使用pthreads v3多執行緒的抓取新浪新聞資訊

<?php class DB extends Worker { private static $db; private $dsn; private $root; private $pwd; public function __constr

python爬蟲，抓取新浪科技的文章（beautifulsoup+mysql）

相關推薦