java 爬蟲爬取酷狗歌手資料

阿新 • • 發佈：2019-01-17

記錄防止忘記

包：

jsoup-1.4.1 html解析

httpcore-4.0.1_1

httpclient-4.0.1

程式碼：

已經訪問的url佇列

//已經訪問連結佇列
public class VisitedUrlQueue {
	public static HashSet<String> visitedUrlQueue = new HashSet<String>();  
	  
    public synchronized static void addElem(String url) {  
        visitedUrlQueue.add(url);  
    }  
  
    public synchronized static boolean isContains(String url) {  
        return visitedUrlQueue.contains(url);  
    }  
  
    public synchronized static int size() {  
        return visitedUrlQueue.size();  
    }  
}

未訪問的佇列

//未訪問url佇列
public class UrlQueue {
	/** 超連結佇列 */  
    public static LinkedList<String> urlQueue = new LinkedList<String>();  
  
    /** 佇列中對應最多的超連結數量 */  
    public static final int MAX_SIZE = 10000;  
  
    public synchronized static void addElem(String url) {  
        urlQueue.add(url);  
    }  
  
    public synchronized static String outElem() {  
        return urlQueue.removeFirst();  
    }  
  
    public synchronized static boolean isEmpty() {  
        return urlQueue.isEmpty();  
    }  
  
    public static int size() {  
        return urlQueue.size();  
    }  
  
    public static boolean isContains(String url) {  
        return urlQueue.contains(url);  
    }  
}

通過url得到頁面html程式碼

public class DownloadPage {
	
	public static String getContentFormUrl(String url) throws Exception {  
		HttpClient client = new DefaultHttpClient();  
        HttpGet getHttp = new HttpGet(url);  
  
        String content = null;  
  
        HttpResponse response;  
        try {  
            /* 獲得資訊載體 */  
            response = client.execute(getHttp);  
            HttpEntity entity = response.getEntity();  
            
            //已經訪問url
            VisitedUrlQueue.addElem(url);  
  
            if (entity != null) {  
                /* 轉化為文字資訊 */  
                content = EntityUtils.toString(entity);  
            }  
  
        } catch (ClientProtocolException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        } finally {  
            client.getConnectionManager().shutdown();  
        }  
  
        return content;  
    }  
	
}

頁面解析

public class ParseOfPage {
	 /** 
     * 獲得url頁面原始碼中超連結 
	 * @throws Exception 
     */  
    public static void getHrefOfContent(String content) throws Exception { 
    	Document doc = Jsoup.parse(content);
    	for(Element e:doc.getElementsByTag("a")){
    		String linkHref = e.attr("href");
    		if(linkHref.startsWith("/album")){ //進行連結篩選
    			linkHref = "http://www.kuwo.cn"+linkHref;//進行連結補充
    		}
    		if(linkHref.startsWith("http://www.kuwo.cn/album")){ //連結篩選，佇列判斷重複後加入佇列
    			if (!UrlQueue.isContains(linkHref)   
    	                 && !VisitedUrlQueue.isContains(linkHref)) { 
    				String urlNew = linkHref.replace(" ","%20");
    				//System.out.println(urlNew);
    	             UrlQueue.addElem(urlNew);  
    	        }
    		}
    		
    	}
    } 
  

    
    //進行自定義解析
    public static void getDataOfContentForSinger(String content) throws Exception {
    	SingerPo po = new SingerPo();
    	
    	Document doc = Jsoup.parse(content);
    	for(Element e:doc.getElementsByClass("artistTop")){
    		po.setPhotourl(e.childNode(1).attr("data-src")); //設定圖片
    	}
    	
    
    	

    	
    }
}

java 爬蟲爬取酷狗歌手資料

記錄防止忘記包： jsoup-1.4.1 html解析 httpcore-4.0.1_1 httpclient-4.0.1 程式碼：已經訪問的url佇列 //已經訪問連結佇列 public class VisitedUrlQueue { public static

python爬蟲——爬取酷狗音樂top500(BeautifulSoup使用方法)

酷狗音樂Top500 進入，並按F12開啟開發者工具（本文以火狐瀏覽器為例）我們開始審查元素，在檢視器中觀察網頁原始碼，或者右鍵檢視頁面原始碼，看原始碼中是否有我們想要的資訊。我們可以在這裡看到歌單資訊，在ul標籤下正好有22條li個標籤，

筆記——用Requests庫和BeautifulSoup庫爬取酷狗音樂資料

酷狗音樂top500榜單鏈接：http://www.kugou.com/yy/rank/home/1-8888.html觀察每頁的url，將第一頁url中home/後的1改成2，就恰好是第二頁的url。首先匯入相應的庫，同時設定好瀏覽器的header：import reque

[Python爬蟲]爬蟲例項:爬取酷狗TOP500的資料

根據書籍《從零開始學Python網路爬蟲》P41，綜合案例2—爬取酷狗TOP500的資料修改而來. 使用模組requests和模組BeautifukSoup進行爬取. 不得不說，酷狗拿來跑爬蟲真是好，不ban不限制IP~ 要爬取的頁面資訊酷狗TOP500 需要爬

Java爬蟲系列之實戰：爬取酷狗音樂網 TOP500 的歌曲(附原始碼)

在前面分享的兩篇隨筆中分別介紹了HttpClient和Jsoup以及簡單的程式碼案例： Java爬蟲系列二：使用HttpClient抓取頁面HTML Java爬蟲系列三：使用Jsoup解析HTML 今天就來實戰下，用他們來抓取酷狗音樂網上的 Top500排行榜音樂。接下來的程式碼

爬蟲程式2-爬取酷狗top500

爬取的內容為酷狗榜單中酷狗top500的音樂資訊，如圖所示。網頁版酷狗不能手動翻頁，進行下一步的瀏覽。但通過觀察第一頁的URL： http://www.kugou.com/yy/rank/home/1-8888.html 這裡嘗試把數字1換為數字2，進行瀏覽，恰好返回的是第2頁的資訊（下圖）。進行

JAVA爬蟲爬取網頁資料資料庫中,並且去除重複資料

pom檔案  <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId&

爬蟲入門，爬取酷狗歌單top500，簡單爬蟲案例

import requests from bs4 import BeautifulSoup import time headers = { 'User-Agent': 'Mozilla/5.0

【爬蟲入門5】爬取酷狗TOP500

#coding utf-8 import time import requests from bs4 import BeautifulSoup class spider_KG_top500(object): def __init__(self):

java爬蟲爬取網站資料例項

WebSite web = new WebSite（“https://www.bdqnhyq.com”）;<fo

Java爬蟲爬取京東商品信息

1.2 image 商品 void code 更改 size pri name 以下內容轉載於《https://www.cnblogs.com/zhuangbiing/p/9194994.html》，在此僅供學習借鑒只用。 Maven地址 <dependency>

爬取酷狗音樂Top500

TP pid 標準 html IT 行緩沖瀏覽器輕松 port 開發環境：windows環境+python3+requests庫(請求)+BeautifulSoup庫(解析) 目標:爬取酷狗音樂Top500並保存到txt中整個案例源代碼： #導入程序需要的庫，req

Python 爬蟲爬取單個基因表格資料的生物學功能（urllib+正則表示式）：

Python 爬蟲爬取單個基因的生物學功能（urllib+正則表示式）： import re import urllib from urllib import request url = 'https://www.ncbi.nlm.nih.gov/gene/?term=FUT1'

java爬蟲爬取資源，小白必須會的入門程式碼塊

java作為目前最火的語言之一，他的實用性也在被無數的java語言愛好者逐漸的開發，目前比較流行的爬取資源，用java來做也更簡單一些,下面是爬取網頁上所有手機型號，引數等極為簡便的資料 package day1805; import java.io.IOException; im

python爬蟲爬取今日頭條APP資料（無需破解as ,cp，_cp_signature引數）

#!coding=utf-8 import requests import re import json import math import random import time from requests.packages.urllib3.exceptions import Insecure

Python 爬蟲爬取單個基因表格資料的生物學功能（urllib+正則表示式）：

Python 爬蟲爬取單個基因的生物學功能（urllib+正則表示式）： import re import urllib from urllib import request url = ‘https://www.ncbi.nlm.nih.gov/gene

Java爬蟲爬取網易汽車車型庫

最近由於工作需要，寫了一個小的爬蟲，主要用於爬取網易汽車車型庫（http://product.auto.163.com/）上的不同品牌/車標（共175個車標）下不同車系（共1650個系列）的的圖片（各八張）程式碼下載程式碼如下：共CarBra

java爬蟲爬取美女圖片

前言：抓住國慶假期的小尾巴，分享一波福利。 if (!existUrl(cache, saveUrl)) { //插入資料庫

Python 爬蟲爬取京東商品評論資料，並存入CSV檔案

利用閒暇時間寫了一個抓取京東商品評論資料的爬蟲。之前寫了抓取拉勾網資料的爬蟲，請參考1，參考2。我的開發環境是Windows + Anaconda3（Python 3.6），家用電腦沒安裝Linux（Linux下也是可以的）。京東的評論資料是通過介面提供的，所以先找

Python爬取酷狗TOP100

import time import requests from bs4 import BeautifulSoup headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537

java 爬蟲爬取酷狗歌手資料

相關推薦