1. 程式人生 > >關於java爬蟲與python爬蟲

關於java爬蟲與python爬蟲

前言

很多人說學習資料探勘,先從爬蟲入手。接觸了大大小小的專案後,發現數據的獲取是資料建模前的一項非常重要的活兒。在此,我需要先總結一些爬蟲的流程,分別有python版的以及java版的。

url請求

java版的程式碼如下:

public String call (String url){
            String content = "";
            BufferedReader in = null;
            try{
                URL realUrl = new URL(url);
                URLConnection connection = realUrl.openConnection();
                connection.connect();
                in
= new BufferedReader(new InputStreamReader(connection.getInputStream(),"gbk")); String line ; while ((line = in.readLine()) != null){ content += line + "\n"; } }catch (Exception e){ e.printStackTrace(); } finally
{ try{ if (in != null){ in.close(); } }catch(Exception e2){ e2.printStackTrace(); } } return content; }

python版的程式碼如下:

# coding=utf-8
import chardet import urllib2 url = "http://www.baidu.com" data = (urllib2.urlopen(url)).read() charset = chardet.detect(data) code = charset['encoding'] content = str(data).decode(code, 'ignore').encode('utf8') print content

正則表示式

java版的程式碼如下:

public String call(String content) throws Exception {
            Pattern p = Pattern.compile("content\":\".*?\"");
            Matcher match = p.matcher(content);
            StringBuilder sb = new StringBuilder();
            String tmp;
            while (match.find()){
                tmp = match.group();
                tmp = tmp.replaceAll("\"", "");
                tmp = tmp.replace("content:", "");
                tmp = tmp.replaceAll("<.*>", "");
                sb.append(tmp + "\n");
            }
            String comment = sb.toString();
            return comment;
        }
    }

python的程式碼如下:

import re
pattern = re.compile(正則)
group = pattern.findall(字串)