網頁資料抓取之大眾點評資料

阿新 • • 發佈：2019-02-02

package com.atman.baiye.store.utils;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import com.atman.baiye.store.domain.AiCommonInfo;

/**
 * remark
 * [email protected]
 * 2017年1月18日下午4:49:56
 */
public class GoodsDianPingUtils {
    
    public static final String DIANPING_URL = "http://www.dianping.com/search/keyword/1/0_"; 
    public static String location_city = "";
    
    public static String getLocationCity(String data){
        int beginIndex = data.indexOf("city J-city\">");
        int endIndex = data.lastIndexOf("city-list J-city-list Hide");
        String str = data.substring(beginIndex, endIndex);
        System.out.println("str=="+str);
        location_city = str.substring(str.indexOf("J-city\">")+"J-city\">".length(), str.indexOf("<i class=\"icon i-arrow\">"));
        System.out.println("location_city:"+location_city);
        return location_city;
    }
    
    public static List<String> getItems(String data){
        int startIndex = data.indexOf("<div class=\"pic\" >");
        int endIndex = data.lastIndexOf("附近</a>");
        String dataStr = data.substring(startIndex, endIndex);
        List<String> list = new ArrayList<String>();
        String arrayitem[] = dataStr.split("附近</a>");
        for (int i = 0; i < arrayitem.length; i++) {
            if(i == 10){
                break ;
            }
            String item = arrayitem[i]+"附近</a>";
            System.out.println("item:"+item);
            list.add(item);
        }
        return list;
    }
    
    public static Map<String, String> getElementVal(String item){
        Map<String, String> map = new HashMap<String, String>();
        int beginIndex = 0;
        int endIndex = 1;
        String store_name = "";
        if(item.contains("<h4>") && item.contains("</h4>")){
            beginIndex = item.indexOf("<h4>");
            endIndex = item.indexOf("</h4>");
            store_name = item.substring(beginIndex+"<h4>".length(), endIndex);
            map.put("store_name", store_name);
            map.put("title", store_name);
        }
        System.out.println("store_name:"+store_name);
        
        String store_url = "http://www.dianping.com/";
        String storeurlStr = item.substring(0, item.indexOf("#comment"));
        beginIndex = storeurlStr.lastIndexOf("<a href=\"");
//        endIndex = storeurlStr.indexOf("#comment");
        store_url += storeurlStr.substring(beginIndex+"<a href=\"".length(), storeurlStr.length());
        map.put("store_url", store_url);
        System.out.println("store_url:"+store_url);
        
        //pic_url data-src="
        String pic_urlstr = item.substring(item.indexOf("data-src=\"")+"data-src=\"".length(), item.indexOf("<div class=\"txt\">"));
        String pic_url = pic_urlstr.substring(0, pic_urlstr.indexOf("\"/>"));
        map.put("pic_url", pic_url);
        map.put("detail_url", pic_url);
        System.out.println("pic_url:"+pic_url);
        
        //price 
        String pricestr = item.substring(item.indexOf("<b>￥")+"<b>￥".length(), item.indexOf("<div class=\"tag-addr\">"));
        String price = pricestr.substring(0, pricestr.indexOf("</b>"));
        map.put("price", price);
        System.out.println("price:"+price);
        map.put("seserev_price", "0");
        return map;
        
    }
    
    public static List<AiCommonInfo> getGoodsInfoList(String jsonInfo, String keyword) {
        List<AiCommonInfo> aiCommonInfoList = new ArrayList<AiCommonInfo>();
        
        List<String> datalist = getItems(jsonInfo);
        for (String dataitem : datalist) {
            Map<String, String> map = getElementVal(dataitem);
            AiCommonInfo aiCommonInfo = new AiCommonInfo();
            aiCommonInfo.setType(1002);
            aiCommonInfo.setTitle((String)map.get("title"));
            aiCommonInfo.setPicUrl(map.get("pic_url"));
            aiCommonInfo.setDetailUrl(map.get("detail_url"));
            aiCommonInfo.setKeyword(keyword);
            aiCommonInfo.setType(1006);
            aiCommonInfo.setSource(3);
            String price = (String)map.get("price");
            aiCommonInfo.setPrice(Double.parseDouble(price));
            String reserve_price = map.get("seserev_price");
            aiCommonInfo.setReservePrice(Double.parseDouble(reserve_price));
            aiCommonInfo.setStoreName(map.get("store_name"));
            aiCommonInfo.setStoreUrl(map.get("store_url"));
            aiCommonInfo.setLocation(location_city);
            aiCommonInfoList.add(aiCommonInfo);
        }
        return aiCommonInfoList;
    }
    
    public static void main(String[] args) {
        String data = WebHttpClient.getBebContentByURL(DIANPING_URL,"火鍋", true, "");
        getLocationCity(data);
        List<String> datalist = getItems(data);
        for (String item : datalist) {
            getElementVal(item);
        }
    }

}

網頁資料抓取之大眾點評資料

package com.atman.baiye.store.utils; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map;

網頁資料抓取之讀取網頁資料

最近專案中需要用到各大網站的資料，這裡沒用爬蟲，用純java程式碼，無任何外掛，抓取一些自己需要的資料！後續會記錄主要的幾個網站資料抓取，主要針對帶單個搜尋框的網站！下面是一個公用的讀取網頁資料操作

資料抓取之反爬蟲規則：使用代理和http頭資訊

之前說個數據抓取遇到的一個坎就是驗證碼，這次來說另外兩個。我們知道web系統可以拿到客戶請求資訊，那麼針對客戶請求的頻率，客戶資訊都會做限制。如果一個ip上的客戶訪問過於頻繁，或者明顯是用程式抓取，肯定是要禁止的。本文針對這兩個問題說下解決方法。其實針對上述兩個問題，解決方法已經很成熟了，無非就是買代理和

charles抓取微信小程式資料(抓取http和https資料)

本文中使用的是mac上的抓包工具charles進行抓包，手機是華為榮耀8，安卓版本7.0（其實跟版本沒啥關係）要想抓取到微信小程式的資料首先要解決的第一個問題件就是如何通過charles抓取手機上的資料（HTTP) 具體配置過程如下：第一步，charles上通過

網頁資訊抓取進階支援Js生成資料 Jsoup的不足之處

轉載請標明出處：http://blog.csdn.net/lmj623565791/article/details/23866427今天又遇到一個網頁資料抓取的任務，給大家分享下。說道網頁資訊抓取，相信Jsoup基本是首選的工具，完全的類JQuery操作，讓人感覺很舒服。但是

QueryList免費線上網頁採集資料抓取工具-toolfk.com

本文要推薦的[ToolFk]是一款程式設計師經常使用的線上免費測試工具箱，ToolFk 特色是專注於程式設計師日常的開發工具，不用安裝任何軟體，只要把內容貼上按一個執行按鈕,就能獲取到想要的內容結果。ToolFk還支援 BarCode條形碼線上

爬取大眾點評資料

通過觀察每個城市的連結主要區別於ranKld，每個城市有特定的ID，因此先獲取到相應城市的ID，便可進行後續抓取。獲取到的城市ID為： [“上海”,“fce2e3a36450422b7fad3f2b90370efd71862f838d1255ea693b9

爬蟲--python3.6+selenium+BeautifulSoup實現動態網頁的資料抓取，適用於對抓取頻率不高的情況

說在前面：本文主要介紹如何抓取頁面載入後需要通過JS載入的資料和圖片本文是通過python中的selenium（pyhton包） + chrome（谷歌瀏覽器） + chromedrive（谷歌瀏覽器驅動） chrome 和chromdrive建議都下最新版本（參考地址：https://blog.c

網頁資料抓取--爬蟲

資料抓取其實從字面意思就知道它是抓取資料的，在網際網路世界中，資料量是一個非常大的。。有時候靠人為去獲取資料這是一個非常不明智的。尤其是你需要的資料來自很多不同的地方。

php 網頁資料抓取簡單例項

最近想學習一下資料抓取方面的知識，花了一箇中午時間邊學便實驗，很快就把程式碼寫出來了，例項寫得比較簡單，學習思路為主。需要注意的是，在目標網頁上獲取的資料如果有中文的話，可能會導致亂碼的情況，這時可以用 iconv ( "UTF-8", "ISO-8859-1//TRANS

R語言實現簡單的網頁資料抓取

在知乎遇到這樣一個問題。這是要爬取的內容的網頁： R語言的程式碼的實現方式如下： #安裝XML包 >install.packages("XML") #載入XML包 > l

Jsoup網頁資料抓取案例

關於Jsoup的基礎知識點這裡就不說了，個人認為很多大牛寫的很詳細也比較全面，這裡就簡單舉一個使用例子玩玩，社長也比較喜歡拿例子來理解一些知識點。給幾個有用的連結： 1、jsoup下載地址 2、待會兒會用到，主要用來測試一些選擇器之類的是否選擇到資料，還可以查詢當前瀏覽

WireShark學習之抓取和分析HTTP資料包

1. 設定過濾條件 - 指定網路協議http 2. 開啟Chrome瀏覽器輸入網址 - 在瀏覽器輸入https://sspai.com/post/30292 3. 在抓獲得包中得到兩個資料包，分別是HTTP請求以及HTTP響應

汽車之家店鋪資料抓取 DotnetSpider實戰[一]

一、背景春節也不能閒著，一直想學一下爬蟲怎麼玩，網上搜了一大堆，大多都是Python的，大家也比

python爬蟲之利用scrapy框架抓取新浪天氣資料

scrapy中文官方文件：點選開啟連結Scrapy是Python開發的一個快速、高層次的螢幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的資料。Scrapy用途廣泛，可以用於資料探勘、監測和自動化測試，Scrapy吸引人的地方在於它是一個框架，任何人都可以根據

Java網頁資料抓取例項

在很多行業中，要對行業資料進行分類彙總，及時分析行業資料，對於公司未來的發展，有很好的參照和橫向對比。所以，在實際工作，我們可能要遇到資料採集這個概念，資料採集的最終目的就是要獲得資料，提取有用的資料進行資料提取和資料分類彙總。很多人在第一次瞭解資料採集的時候，可能無

HttpClient+jsoup實現網頁資料抓取和處理

這裡僅簡單介紹一種我曾用到的網頁資料的抓取和處理方案。通過HttpClient可以很方便的抓取靜態網頁資料，過程很簡單，步驟如下： //構造client HttpClient client = new HttpClient(); //構建GetMethod物件 Get

一次網頁資料抓取採集儲存我的電子商務業務

最近我注意到許多電子商務指南都關注相同的技巧：增加你的社交活動投資chatbots構建一個AR應用程式雖然這些都是很棒的提示，但我在這裡只給你一個刮傷黑客的資訊，這可以幫助我的公司不再關機。（如果您沒有使用網路抓取您的線上業務，請檢視此部落格）。image: https://

POST獲取網易部落格資料(網頁抓取，模擬登陸資料學習備份）

下面這個日誌網站（http://www.crifan.com/）的類別“Category Archives: Crawl_emulatelogin”：裡有很多網頁解析和抓取以及模擬登陸的學習資料，並給出了個部落格搬家的工具：BlogsTo

實現從網頁上抓取資料(htmlparser)

package com.jscud.test; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.InputStreamReader; impo

網頁資料抓取之大眾點評資料

相關推薦