Java爬蟲學習《一、爬取網頁URL》
阿新 • • 發佈:2018-11-02
導包,如果是用的maven,新增依賴:
<dependency> <groupId>commons-httpclient</groupId> <artifactId>commons-httpclient</artifactId> <version>3.1</version> </dependency> <dependency> <groupId>commons-httpclient</groupId> <artifactId>commons-httpclient</artifactId> <version>3.1</version> </dependency> <dependency> <groupId>commons-httpclient</groupId> <artifactId>commons-httpclient</artifactId> <version>3.1</version> </dependency>
Java程式碼:
package com.ai.rai.group.system; import java.io.*; import java.net.MalformedURLException; import java.net.URL; import java.net.URLConnection; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * @version 1.0 * @ClassName RetrivePage * @Description * @Author 74981 * @Date 2018/10/19 14:32 */ public class RetrivePage { // 設定代理伺服器 // static { // // 設定代理伺服器的 IP 地址和埠 // httpClient.getHostConfiguration().setProxy("10.21.67.39", 8088); // } public static void downloadPage(String path){ URL url; URLConnection urlconn; BufferedReader br = null; PrintWriter pw = null; //url匹配規則 String regex = "https://[\\w+\\.?/?]+\\.[A-Za-z]+"; Pattern p = Pattern.compile(regex); try { url = new URL(path);//爬取的網址 urlconn = url.openConnection(); //將爬取到的連結放到D盤的SiteURL檔案中 pw = new PrintWriter(new FileWriter("D:/SiteURL.txt"), true); br = new BufferedReader(new InputStreamReader( urlconn.getInputStream())); String buf; while ((buf = br.readLine()) != null) { Matcher buf_m = p.matcher(buf); while (buf_m.find()) { pw.println(buf_m.group()); } } System.out.println("爬取成功^_^"); } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { try { br.close(); } catch (IOException e) { e.printStackTrace(); } pw.close(); } } /** * 測試程式碼 */ public static void main(String[] args) { // 抓取 這個人部落格 首頁,輸出 try { RetrivePage.downloadPage("https://blog.csdn.net/SELECT_BIN"); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
控制檯輸出:
輸出檔案: