java程式爬取網頁上的圖片

阿新 • • 發佈：2018-12-18

最近需要在網上找一寫圖片，所以寫了一個爬取圖片的程式，新手有寫的不足之處還請各位大佬指點一二。

原始碼如下

package com.sysh.ssm.service;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * @author:
 * @Date:2018/5/22
 */
public class SpiderPicturesFromBaiduByWord {


    public static void main(String[] args) throws Exception{
        String downloadPath = "C:\\sunhoo\\image";

        System.out.println("輸入爬取關鍵字（可用空格，、號分隔多個想爬的關鍵字）：");
        Scanner KeyWord = new Scanner(System.in);
        String Word =KeyWord.nextLine();
        System.out.println("輸入要下載的頁數（1表示一頁，一頁有30張圖片）");
        Integer pageSize=KeyWord.nextInt();
        List<String> list = nameList(Word);
        getPictures(list,pageSize,downloadPath); //1代表下載一頁，一頁一般有30張圖片
    }


    public static void getPictures(List<String> keywordList, int max,String downloadPath) throws Exception{ // key為關鍵詞,max作為爬取的頁數
        String gsm=Integer.toHexString(max)+"";
        String finalURL = "";
        String tempPath = "";
        //每頁的數量
        Integer pagenumber=10;
        for(String keyword : keywordList){
            tempPath = downloadPath;
            if(!tempPath.endsWith("\\")){
                tempPath = downloadPath+"\\";
            }
            tempPath = tempPath+keyword+"\\";
            File f = new File(tempPath);
            if(!f.exists()){
                f.mkdirs();
            }
            int picCount = 1;
            for(int page=0;page<=max;page++) {
                sop("正在下載第"+page+"頁面");
                Document document = null;
                try {
                    String url ="http://image.baidu.com/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word="+keyword+"&cg=star&pn="+page*pagenumber+"&rn=30&itg=0&z=0&fr=&width=&height=&lm=-1&ic=0&s=0&st=-1&gsm="+Integer.toHexString(page*pagenumber);
                    //String url ="https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1540974009530_R&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E5%8D%8E%E5%B1%B1"+Integer.toHexString(page*30);
                    sop(url);
                    document = Jsoup.connect(url).data("query", "Java")//請求引數
                            .userAgent("Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)")//設定urer-agent  get();
                            .timeout(5000)
                            .get();
                    String xmlSource = document.toString();
                    xmlSource = StringEscapeUtils.unescapeHtml3(xmlSource);
                    sop("頁面"+xmlSource.length());
                    String reg = "objURL\":\"http://.+?\\.jpg";
                    Pattern pattern = Pattern.compile(reg);
                    Matcher m = pattern.matcher(xmlSource);
                    sop("mmm"+m);
                    while (m.find()) {
                        finalURL = m.group().substring(9);
                        sop(keyword+picCount+++":"+finalURL);
                        download(finalURL,tempPath);
                        sop("下載成功");
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        sop("下載完畢");
        delMultyFile(downloadPath);
        sop("已經刪除所有空圖");
    }
    public static void delMultyFile(String path){
        File file = new File(path);
        if(!file.exists())
        { throw new RuntimeException("File \""+path+"\" NotFound when excute the method of delMultyFile()....");}
        File[] fileList = file.listFiles();
        File tempFile=null;
        for(File f : fileList){
            if(f.isDirectory()){
                { delMultyFile(f.getAbsolutePath());}
            }else{
                if(f.length()==0)
                { sop(f.delete()+"---"+f.getName());}
            }
        }
    }
    public static List<String> nameList(String nameList){
        List<String> arr = new ArrayList<String>();
        String[] list;
        if(nameList.contains(","))
        { list= nameList.split(",");}
        else if(nameList.contains("、"))
        { list= nameList.split("、");}
        else if(nameList.contains(" "))
        {list= nameList.split(" ");}
        else{
            arr.add(nameList);
            return arr;
        }
        for(String s : list){
            arr.add(s);
        }
        return arr;
    }
    public static void sop(Object obj){
        System.out.println(obj);
    }
    //根據圖片網路地址下載圖片
    public static void download(String url,String path){
        //path = path.substring(0,path.length()-2);
        File file= null;
        File dirFile=null;
        FileOutputStream fos=null;
        HttpURLConnection httpCon = null;
        URLConnection con = null;
        URL urlObj=null;
        InputStream in =null;
        byte[] size = new byte[1024];
        int num=0;
        try {
            String downloadName= url.substring(url.lastIndexOf("/")+1);
            dirFile = new File(path);
            if(!dirFile.exists() && path.length()>0){
                if(dirFile.mkdir()){
                    sop("creat document file \""+path.substring(0,path.length()-1)+"\" success...\n");
                }
            }else{
                file = new File(path+downloadName);
                fos = new FileOutputStream(file);
                if(url.startsWith("http")){
                    urlObj = new URL(url);
                    con = urlObj.openConnection();
                    httpCon =(HttpURLConnection) con;
                    in = httpCon.getInputStream();
                    while((num=in.read(size)) != -1){
                        for(int i=0;i<num;i++)
                        {  fos.write(size[i]);}
                    }
                }
            }
        }catch (FileNotFoundException notFoundE) {
            sop("找不到該網路圖片....");
        }catch(NullPointerException nullPointerE){
            sop("找不到該網路圖片....");
        }catch(IOException ioE){
            sop("產生IO異常.....");
        }catch (Exception e) {
            e.printStackTrace();
        }finally{
            try {
                fos.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

}

java程式爬取網頁上的圖片

最近需要在網上找一寫圖片，所以寫了一個爬取圖片的程式，新手有寫的不足之處還請各位大佬指點一二。原始碼如下 package com.sysh.ssm.service; import org.apache.commons.lang3.StringEscapeUtils; i

使用Requests庫和BeautifulSoup庫來爬取網頁上需要的文字與圖片

Pythone現在已經成為全球最火爆的語言了，它的強大之處想必不需要我多說吧。接下來我就Python網路爬蟲來談一談本渣渣的見解。 -----------------------------------------------------------------------

[原創]python爬蟲之BeautifulSoup,爬取網頁上所有圖片標題並存儲到本地文件

%20 分享圖片本地 col cbc quest 執行 python div from bs4 import BeautifulSoup import requests import re import os r = requests.get("https:/

Python爬取網頁的圖片資料

本案例是基於PyCharm開發的，也可以使用idea。在專案內新建一個python檔案TestCrawlers.py TestCrawlers.py # 匯入urllib下的request模組 import urllib.request # 匯入正則匹配包 import re

JAVA爬蟲爬取網頁資料資料庫中,並且去除重複資料

pom檔案  <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId&

用Python爬取網頁上的小說，讓你從此告別書荒！

eset 爬取網頁網站鏈接表頭寫入改變 span 人生人生苦短，我用Python。有道愛看小說的小夥伴們，在看小說的期間總會遇到那麽一段書荒期，在這段期間想看書卻找不到，要麽就是要VIP，要麽就是下載不了。所以學會爬取網站上的小說是很有必要的，今天就以爬取筆趣閣

python初級實戰系列教程《一、爬蟲之爬取網頁、圖片、音視訊》

python基礎知識可以到廖雪峰大佬的官網學習哦！廖雪峰官網網址學完python就開始我們的實戰吧！首先我們就來學習下python爬蟲學習Python爬蟲，先是介紹一個最容易上手的庫urll

java爬蟲爬取網頁資訊

今天接觸到一個專案中非要讓我用爬蟲來爬取一個學校網頁的新聞頁面加子頁面所有文字資訊，畢竟需求就是上帝，然後查詢了一會之後發現並不難就是匹配字元、標籤是有些麻煩好了直接上pom.xml &

java爬蟲爬取網際網路上的各大影視網站---360影視（附原始碼下載）

關於爬蟲：全球資訊網上有著無數的網頁，包含著海量的資訊，無孔不入、森羅永珍。但很多時候，無論出於資料分析或產品需求，我們需要從某些網站，提取出我們感興趣、有價值的內容，但是縱然是進化到21世紀的人類，依然只有兩隻手，一雙眼，不可能去每一個網頁去點去看，然後再複製貼上。所以

用python爬蟲爬取網頁桌布圖片（彼岸桌面網唯美圖片）

今天想給我的電腦裡面多加點桌布，但是嫌棄一個個儲存太慢，於是想著寫個爬蟲直接批量爬取，因為爬蟲只是很久之前學過一些，很多基礎語句都不記得了，於是直接在網上找了個有基礎操作語句的爬蟲程式碼，在這上面進行修改以適應我的要求和爬取的網頁需求注意：這次爬取的

python，爬蟲爬取網頁的圖片，基礎改善版

突然發現樣式太坑，還要爬取在css裡面，寫了個基礎的，解決下朋友的問題 import string import urllib.request import re import os import urllib # 根據給定的網址來獲取網頁詳細資訊，得到的

java 反編譯知識學習彙總 java網路爬取網頁程式碼

以下文章可能有參考別人的程式碼而彙總的內容請各位大俠合作愉快借鑑一下 http://blog.csdn.net/qq_26891045/article/details/52517585 http://blog.csdn.net/dongnan591

利用java-maven程式爬取西刺網頁的ip代理

主要程式碼: package com.itquwei.spider; import java.io.IOException; import java.nio.charset.Charset; import org.apache.http.HttpEntity; import org.a

[Java爬蟲] 使用 Jsoup + HttpClient 爬取網頁圖片

一、前言把一篇圖文並茂的優秀文章全部爬取下來，就少不了 Java 爬蟲裡邊的圖片爬取技術了。很多人都用來爬取美女圖片，但是筆者覺得這有傷大雅。下面筆者使用它來爬取 CSDN 【今日推薦】文章附帶的圖片二、程式碼、依賴筆者對本程式碼經過多次

一個鹹魚的Python爬蟲之路（三）：爬取網頁圖片

you os.path odin 路徑生成存在 parent lose exist 學完Requests庫與Beautifulsoup庫我們今天來實戰一波，爬取網頁圖片。依照現在所學只能爬取圖片在html頁面的而不能爬取由JavaScript生成的圖。所以我找了這個網站

python爬取網頁圖片

ima com col list https pytho 表達式 images 5% 在Python中使用正則表達式，一個小小的爬蟲，抓取百科詞條網頁的jpg圖片。下面就是我的代碼，作為參考： #coding=utf-8 # __author__ = ‘Hinfa‘ im

簡單的爬取網頁圖片

baidu alt idt ima 修改利用表達輸入 html import reimport urllib.request# ------ 獲取網頁源代碼的方法 ---def getHtml(url): page = urllib.request.urlope

Java爬蟲學習《一、爬取網頁URL》

導包，如果是用的maven，新增依賴： <dependency> <groupId>commons-httpclient</groupId> <artifactId>commons

requests與BeautifulSoup爬取網頁圖片

requests+BeautifulSoup爬取網頁圖片最近一直抽時間在看requests+BeautifulSoup爬取網頁內容這一塊的內容，所以，打算把自己看的總結一下，分享也是一種學醫，給自己做做筆記。 1.首先，我們看一下requests庫 requests

爬取網頁瀑布流圖片

import requestsfrom urllib import requesturl = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&a

java程式爬取網頁上的圖片

相關推薦