1. 程式人生 > >Java資料爬取——爬取攜程酒店資料(二)

Java資料爬取——爬取攜程酒店資料(二)

1.首先思考怎樣根據地域獲取地域酒店資訊,那麼我們看一下攜程上是怎樣獲得的。
還是開啟http://hotels.ctrip.com/domestic-city-hotel.html 這個地址,隨便點選一個地區進去(這裡我選取澳門作為示例),點選第二頁資料,很高興發現

這裡寫圖片描述

2.模擬請求,獲取資料

這裡寫圖片描述

從這裡可以發現,是POST請求,請求的body中有很多引數,

這裡寫圖片描述

所以先封裝一下引數,經過測試,有些引數可以不用傳遞,這裡寫了一個方法,模擬請求

public String getHotelListString(HotelCity city, String page){
        HashMap<String, String> params = new
HashMap<String, String>(); params.put("__VIEWSTATEGENERATOR", "DB1FBB6D"); params.put("cityName", city.getCityName()); // params.put("StartTime", "2016-11-24"); // params.put("DepTime", "2016-11-25"); params.put("txtkeyword", ""); params.put("Resource", ""); params.put("Room"
, ""); params.put("Paymentterm", ""); params.put("BRev", ""); params.put("Minstate", ""); params.put("PromoteType", ""); params.put("PromoteDate", ""); params.put("operationtype", "NEWHOTELORDER"); params.put("PromoteStartDate", ""); params.put("PromoteEndDate"
, ""); params.put("OrderID", ""); params.put("RoomNum", ""); params.put("IsOnlyAirHotel", "F"); params.put("cityId", city.getCityId()); params.put("cityPY", city.getPinyin()); // params.put("cityCode", "1853"); // params.put("cityLat", "22.1946"); // params.put("cityLng", "113.549"); params.put("positionArea", ""); params.put("positionId", ""); params.put("keyword", ""); params.put("hotelId", ""); params.put("htlPageView", "0"); params.put("hotelType", "F"); params.put("hasPKGHotel", "F"); params.put("requestTravelMoney", "F"); params.put("isusergiftcard", "F"); params.put("useFG", "F"); params.put("HotelEquipment", ""); params.put("priceRange", "-2"); params.put("hotelBrandId", ""); params.put("promotion", "F"); params.put("prepay", "F"); params.put("IsCanReserve", "F"); params.put("OrderBy", "99"); params.put("OrderType", ""); params.put("k1", ""); params.put("k2", ""); params.put("CorpPayType", ""); params.put("viewType", ""); // params.put("checkIn", "2016-11-24"); // params.put("checkOut", "2016-11-25"); params.put("DealSale", ""); params.put("ulogin", ""); params.put("hidTestLat", "0%7C0"); // params.put("AllHotelIds", "436450%2C371379%2C396332%2C419374%2C345805%2C436553%2C425997%2C436486%2C436478%2C344977%2C5605870%2C344983%2C371396%2C344979%2C2572033%2C699384%2C425795%2C419823%2C2010726%2C5772619%2C1181591%2C2005951%2C345811%2C371381%2C371377");// TODO params.put("psid", ""); params.put("HideIsNoneLogin", "T"); params.put("isfromlist", "T"); params.put("ubt_price_key", "htl_search_result_promotion"); params.put("showwindow", ""); params.put("defaultcoupon", ""); params.put("isHuaZhu", "False"); params.put("hotelPriceLow", ""); params.put("htlFrom", "hotellist"); params.put("unBookHotelTraceCode", ""); params.put("showTipFlg", ""); // params.put("hotelIds", "436450_1_1,371379_2_1,396332_3_1,419374_4_1,345805_5_1,436553_6_1,425997_7_1,436486_8_1,436478_9_1,344977_10_1,5605870_11_1,344983_12_1,371396_13_1,344979_14_1,2572033_15_1,699384_16_1,425795_17_1,419823_18_1,2010726_19_1,5772619_20_1,1181591_21_1,2005951_22_1,345811_23_1,371381_24_1,371377_25_1");// TODO params.put("markType", "1"); params.put("zone", ""); params.put("location", ""); params.put("type", ""); params.put("brand", ""); params.put("group", ""); params.put("feature", ""); params.put("equip", ""); params.put("star", ""); params.put("sl", ""); params.put("s", ""); params.put("l", ""); params.put("price", ""); params.put("a", "0"); params.put("keywordLat", ""); params.put("keywordLon", ""); params.put("contrast", "0"); params.put("page", page); params.put("contyped", "0"); params.put("productcode", ""); String result = HttpUtil.getInstance().httpPost(hotelUrl, params); // 資料中有轉義符直接轉JSON報錯,所以這裡重新拼接所需要的JSON資料 String tempHotel = result.substring(result.indexOf("hotelPositionJSON")-1, result.length()); // 確保擷取到indexOf("biRecord"), 減2是因為需要]符號 String hotelArray = tempHotel.substring(0, tempHotel.indexOf("biRecord") - 2); String tempTotalCount = result.substring(result.indexOf("hotelAmount")-1, result.length()); String totalCount = tempTotalCount.substring(0, tempTotalCount.indexOf(",")); StringBuffer sb = new StringBuffer(); sb.append("{"); sb.append(totalCount); sb.append(","); sb.append(hotelArray); sb.append("}"); return sb.toString().replace("\\", ""); }

說明:這個方法對返回結果進行了操作,直接把hotelPositionJSON和hotelAmount的內容提取了出來(因為直接轉換時發現數據中含有轉義符導致轉為JSON失敗),操作後的資料格式為:

{
    "hotelAmount": 11265,
    "hotelPositionJSON": [
        {
            "id": "6297824",
            "name": "北京浣川招待所",
            "lat": "39.918086",
            "lon": "116.427508",
            "url": "/hotel/6297824.html?isFull=F#ctm_ref=hod_sr_map_dl_txt_1",
            "img": "http://pic.c-ctrip.com/hotels110127/hotel_example.jpg",
            "address": "東城區東城區外交部街46號-1。 ( 北京站、建國門地區)",
            "score": "0.0",
            "dpscore": "0",
            "dpcount": "0",
            "star": "hotel_diamond01",
            "stardesc": "攜程使用者評定為1鑽",
            "shortName": "",
            "isSingleRec": "false"
        },
        {
            "id": "6298279",
            "name": "北京懷柔喇叭溝門海燕農家院",
            "lat": "40.956471",
            "lon": "116.513108",
            "url": "/hotel/6298279.html?isFull=F#ctm_ref=hod_sr_map_dl_txt_2",
            "img": "http://pic.c-ctrip.com/hotels110127/hotel_example.jpg",
            "address": "懷柔區黃甸子村。 ( 懷柔風景區)",
            "score": "0.0",
            "dpscore": "0",
            "dpcount": "0",
            "star": "hotel_diamond01",
            "stardesc": "攜程使用者評定為1鑽",
            "shortName": "",
            "isSingleRec": "false"
        }
    ]
}

3.迴圈遍歷城市資料,查詢每個城市的酒店即可

long startTime = System.currentTimeMillis();
HotelCitySpider citySpider = new HotelCitySpider();
HotelSpider spider = new HotelSpider();
List<HotelCity> cities = citySpider.getHotelCities();
long getCityTime = System.currentTimeMillis();
System.out.println("獲取城市所用時間(ms):" + (getCityTime - startTime));
spider.createTable();
for (HotelCity hotelCity : cities) {
    spider.saveHotels(hotelCity, spider.getHotelList(hotelCity));
}
long saveHotelTime = System.currentTimeMillis();
System.out.println("獲取酒店並存儲所用時間(ms):" + (saveHotelTime - startTime));

這裡我仍然將資料儲存到資料庫中

4.批量插入資料

/**
     * 儲存每個城市的酒店列表
     * @param city
     * @param hotels
     */
    public void saveHotels(HotelCity city, List<Hotel> hotels) {
        for (Hotel hotel : hotels) {
            StringBuffer insert_sql = new StringBuffer();
            insert_sql.append("insert into ctrip_hotel "
                    + "(hotel_id, city_id, city_name, name, lat, lon, url, img, address, score, dpscore, dpcount, star, stardesc, shortName, isSingleRec) values (");
            insert_sql.append("'" + hotel.getId() + "'");
            insert_sql.append(", " + city.getCityId());
            insert_sql.append(", '" + city.getCityName() + "'");
            insert_sql.append(", '" + hotel.getName() + "'");
            insert_sql.append(", " + hotel.getLat());
            insert_sql.append(", " + hotel.getLon());
            insert_sql.append(", '" + hotel.getUrl() + "'");
            insert_sql.append(", '" + hotel.getImg() + "'");
            insert_sql.append(", '" + hotel.getAddress() + "'");
            insert_sql.append(", " + hotel.getScore());
            insert_sql.append(", " + hotel.getDpscore());
            insert_sql.append(", " + hotel.getDpcount());
            insert_sql.append(", '" + hotel.getStar() + "'");
            insert_sql.append(", '" + hotel.getStardesc() + "'");
            insert_sql.append(", '" + hotel.getShortName() + "'");
            insert_sql.append(", " + hotel.getIsSingleRec() + ")");
            try {
                preparedStatement = conn.prepareStatement(insert_sql.toString());
                preparedStatement.execute();
            } catch (Exception e) {
                e.getMessage();
                continue;
            }
        }
    }

原本想批量插入,但是資料有重複的hotel_id,preparedStatement.executeBatch()會報錯,所以還是一條條插入。
github原始碼地址 https://github.com/jianiuqi/CTripSpider