【爬蟲】簡單的Java爬蟲，爬取Sogou微信的首頁熱門文章

阿新 • • 發佈：2019-01-19

工作中遇到了一個場景，需要使用Sogou微信的熱門文章做展示，調研了一段時間，沒有發現有比較好用的免費介面，所以自己寫了一個，非常簡單。

儲存Sogou熱門文章需要的類：

/**
 * @author TangLei
 */
public class Article{

    //頭像圖片
    private String headImg;
    //標題
    private String topic;
    //文章連結
    private String articleUrl;
    //文章圖片
    private String articleImg;
    //閱讀次數
    private 
 String readCount;
    //時間戳
    private String time;

    public String getHeadImg() {
        return headImg;
    }

    public void setHeadImg(String headImg) {
        this.headImg = headImg;
    }

    public String getTopic() {
        return topic;
    }

    public void setTopic(String topic) {
        this 
.topic = topic;
    }

    public String getArticleUrl() {
        return articleUrl;
    }

    public void setArticleUrl(String articleUrl) {
        this.articleUrl = articleUrl;
    }

    public String getArticleImg() {
        return articleImg;
    }

    public void setArticleImg(String articleImg) {
        this 
.articleImg = articleImg;
    }

    public String getReadCount() {
        return readCount;
    }

    public void setReadCount(String readCount) {
        this.readCount = readCount;
    }

    public String getTime() {
        return time;
    }

    public void setTime(String time) {
        this.time = time;
    }
}

爬蟲實現：

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.alibaba.fastjson.JSON;

/**
 * 網路爬蟲，獲取搜狗微信搜尋_訂閱號及文章內容
 * @author TangLei
 */
public class Crawler {

    public String doCrawler(String url) {
        return dealPage(getPage(url));
    }

    /**
     * 讀取整個網頁
     * 
     * @param url:網頁地址
     * @return 整個網頁的內容
     */
    private String getPage(String url) {
        StringBuffer buffer = new StringBuffer();
        BufferedReader reader = null;
        try {
            URL realUrl = new URL(url);
            URLConnection conn = realUrl.openConnection();
            conn.connect();
            reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
            String line = null;
            while ((line = reader.readLine()) != null) {
                buffer.append(line).append("\r\n");
            }
        } catch (IOException e) {
            e.printStackTrace();
            System.out.println("read web page error : " + e.getMessage());
        } finally {
            try {
                reader.close();
            } catch (IOException e) {
                e.printStackTrace();
                System.out.println("close reader error : " + e.getMessage());
            }
        }
        return buffer.toString();
    }

    /**
     * 對讀取到的網頁進行處理擷取有用資訊
     * 
     * @param page
     *            網頁內容
     * @return 擷取到的資訊
     */
    private String dealPage(String page) {

        //所有的正則表示式
        Pattern[] patterns = new Pattern[4];
        patterns[0] = Pattern.compile("<p><img src=\"[a-zA-z]+://[^\\s]*");
        patterns[1] = Pattern.compile("<h4><a uigs=.*</a></h4>");
        patterns[2] = Pattern.compile("<img src=.*\" onload=\"vrImgLoad\\(this,'fit',110,110\\)\"");
        patterns[3] = Pattern.compile("閱讀&nbsp;.*</bb>");

        //首先計算有多少篇文章
        int articleCount = 0;
        Matcher matcher = patterns[0].matcher(page);
        while (matcher.find()) {
            articleCount++;
        }

        //根據文章數量建立相應長度的陣列
        Article[] articles = new Article[articleCount];
        for (int i = 0; i < articleCount; i++) {
            articles[i] = new Article();
        }

        //開始處理
        int index = 0;
        for (int i = 0; i < patterns.length; i++) {

            index = 0;
            matcher = patterns[i].matcher(page);
            while (matcher.find()) {
                String find = matcher.group();
                if(i==0){
                    String[] splits = find.split("\"");
                    if(index<articleCount){
                        articles[index].setHeadImg(splits[1]);
                    }
                }else if (i == 1){
                    String[] splits = find.split("\"");
                    if(index<articleCount){
                        articles[index].setArticleUrl(splits[3]);
                        articles[index].setTopic(splits[6].substring(1,splits[6].length()-9));
                    }
                }else if (i == 2){
                    String[] splits = find.split("\"");
                    if(index<articleCount){
                        articles[index].setArticleImg(splits[1]);
                    }
                }else if (i == 3){
                    String[] splits = find.split("&nbsp;");
                    if(index<articleCount){
                        articles[index].setReadCount(splits[1]);
                        articles[index].setTime(splits[4].substring(13, 23));
                    }
                }
                index++;
            }
        }
        return JSON.toJSONString(articles);
    }

    public static void main(String[] args) throws Exception {
        String url = "http://weixin.sogou.com";
        Crawler crawler = new Crawler();
        System.out.println(crawler.doCrawler(url));
    }
}

非常簡單的一個爬蟲，可以參考一下，作為爬蟲入門看一下也非常好。

【爬蟲】簡單的Java爬蟲，爬取Sogou微信的首頁熱門文章

工作中遇到了一個場景，需要使用Sogou微信的熱門文章做展示，調研了一段時間，沒有發現有比較好用的免費介面，所以自己寫了一個，非常簡單。儲存Sogou熱門文章需要的類： /** * @author TangLei */ public class A

爬蟲系列（2）-----python爬取CSDN博客首頁所有文章

成功 -name 保存 eas attr eve lan url att 對於Python初學者來說，爬蟲技能是應該是最好入門，也是最能夠有讓自己有成就感的，今天在整理代碼時，整理了一下之前自己學習爬蟲的一些代碼，今天上第2個簡單的例子，python爬取CSDN博客首頁所有

【Python3 爬蟲】爬取博客園首頁所有文章

表達式技術標記 itl 1.0 headers wow64 ignore windows 首先，我們確定博客園首頁地址為：https://www.cnblogs.com/ 我們打開可以看到有各種各樣的文章在首頁，如下圖：我們以上圖標記的文章為例子吧！打開網頁源碼，搜

python爬取快手ios端首頁熱門視頻

ima main PE cati 找到 OS color AD span 最近快手這種小視頻app，特別的火，中午吃過午飯，閑來無聊，想搞下快手的短視頻，看能不能搞到。於是乎，打開了fiddler，開始準備抓

Python爬取指定微信公眾號所有文章！

篇文章使用到的技術: mitmdump + 電腦版微信先分析開啟視覺化抓包工具, 勾選https代理。然後開啟電腦版微信任意點選一個公眾號，再點選檢視歷史訊息進群：960410445 即可獲取原始碼！開啟後這樣 &nb

【爬蟲】002 python3 +beautifulsoup4 +requests 爬取靜態頁面

bgcolor img err 預覽政府 bold 技術貴的頁面元素實驗環境: win7 python3.5 bs4 0.0.1 requests 2.19 實驗日期：2018-08-07 爬取網站：http://www.xhsd.cn/ 現在的網站大多有復雜

【Python3爬蟲】使用Fidder實現APP爬取

telerik tail 實現鏈接端口號 dpi () vco 軟件之前爬取都是網頁上的數據，今天要來說一下怎麽借助Fidder來爬取手機APP上的數據。一、環境配置 1、Fidder的安裝和配置沒有安裝Fidder軟件的可以進入這個網址下載，然後就是傻瓜式的

【Python】簡單網路爬蟲實現

引言網路爬蟲（英語：web crawler），也叫網路蜘蛛（spider），是一種用來自動瀏覽全球資訊網的網路機器人。其目的一般為編纂網路索引。 --維基百科網路爬蟲可以將自己所訪問的頁面儲存下來，以便搜尋引擎事後生成索引供使用者搜尋。一般有兩個步驟：1.獲取網頁內

【Python爬蟲】Scrapy框架運用1—爬取豆瓣電影top250的電影資訊(1)

一、Step step1: 建立工程專案 1.1建立Scrapy工程專案 E:\>scrapy startproject 工程專案 1.2使用Dos指令檢視工程資料夾結構 E:\>tree /f step2: 建立spid

Python爬蟲入門【8】：蜂鳥網圖片爬取之三

蜂鳥網圖片--囉嗦兩句前面的教程內容量都比較大，今天寫一個相對簡單的，爬取的還是蜂鳥，依舊採用aiohttp 希望你喜歡爬取頁

Python爬蟲入門教程【7】：蜂鳥網圖片爬取之二

蜂鳥網圖片--簡介今天玩點新鮮的，使用一個新庫 aiohttp ，利用它提高咱爬蟲的爬取速度。安裝模組常規套路 pip ins

Python爬蟲小實踐：尋找失蹤人口，爬取失蹤兒童信息並寫成csv文件，方便存入數據庫

python tor enc mini 執行 gem view 獲取但是前兩天有人私信我，讓我爬這個網站，http://bbs.baobeihuijia.com/forum-191-1.html上的失蹤兒童信息，準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童，這

爬蟲（進階），爬取網頁資訊並寫入json檔案

import requests # python HTTP客戶端庫，編寫爬蟲和測試伺服器響應資料會用到的類庫 import re import json from bs4 import BeautifulSoup import copy print('正在爬取網頁連結……'

Python 爬蟲爬取指定微信公眾號文章

該方法是依賴於urllib2庫來完成的，首先你需要安裝好你的python環境，然後安裝urllib2庫程式的起始方法(返回值是公眾號文章列表)： def openUrl(): print("啟動爬蟲，開啟搜狗搜尋微信介面") # 載入頁面 url

爬蟲小練手-爬取慕課網首頁的圖片

#!/usr/bin/python #-*- coding:utf-8 -*- import re import requests import Queue import threading import urllib from bs4 import BeautifulSo

爬蟲系列3：Requests+Xpath 爬取租房網站信息並保存本地

imp 情侶 http \n 頻率 lazy desktop 火車 mode 數據保存本地參考前文爬蟲系列1：https://www.cnblogs.com/yizhiamumu/p/9451093.html 參考前文爬蟲系列2：https://www.cnblo

【Scrapy】CrawlSpider 單頁面Ajax爬取

專案目標爬取拉勾網職位列表基本資訊+職位描述專案思考拉勾網的招聘崗位列表，這是Ajax非同步載入的。我想把崗位列表所顯示的資訊爬取下來，同時還需要崗位的工作詳情。爬取流程就是一開始就不斷獲取職位列表的json，然後從json中提取對應的職位詳情頁，再進

python3爬蟲-爬取新浪新聞首頁所有新聞標題

準備工作：安裝requests和BeautifulSoup4。開啟cmd，輸入如下命令 pip install requests pip install BeautifulSoup4 按F12開啟開發人員工具，點選左上角的圖片，然後再頁面中點選你想檢

【Python】Requests+正則表示式爬取貓眼電影TOP100

1.先獲取到一個頁面，狀態碼200是成功返回 def get_one_page(url): # 獲取一個頁面 try: response = requests.get(url) if response.status_cod

（55）-- 簡單爬取人人網個人首頁資訊

# 簡單爬取人人網個人首頁資訊from urllib import request base_url = 'http://www.renren.com/964943656' headers = { "Host" : "www.renren.com", "Co

【爬蟲】簡單的Java爬蟲，爬取Sogou微信的首頁熱門文章

相關推薦