1. 程式人生 > >用webmagic實現一個java爬蟲小專案

用webmagic實現一個java爬蟲小專案

一、環境

      專案:maven專案

      資料庫:mysql

   

 

二、專案介紹

      我們要爬去的頁面是https://shimo.im/doc/iKYXMBsZ5x0kui8P

     假設我們需要進入這個頁面,爬取頁面裡面的所有電影百度雲連結,並儲存在mysql資料庫裡。

    

      

 

三、pom.xml配置

  首先我們需要新建一個maven專案,並在pom.xml配置如下jar包。

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.jk</groupId>
<artifactId>shimo</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>

<name>shimo</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>

<properties>
<application.class>com.jk.ShiMoChromeProcessor</application.class>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
</dependency>

<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.0.1</version>
</dependency>

<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-remote-driver</artifactId>
<version>3.0.1</version>
</dependency>

<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version>1.2.1</version>
</dependency>

<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-exec</artifactId>
<version>1.3</version>
</dependency>

<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.6</version>
</dependency>

</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.jk.ShiMoChromeProcessor</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>

</plugins>
</build>
</project>

 

三、下載谷歌瀏覽器和谷歌瀏覽器驅動

      我們這裡採用selenium爬去動態網頁,這也是目前比較常見的方法。如果不清楚這個,可以提前看看。我們需要用下面程式碼來模擬一個谷歌瀏覽器,其中chromebin就是你電腦下載的谷歌瀏覽器路徑,chromedriver是谷歌瀏覽器驅動,userdata是你下載谷歌瀏覽器後的User Date資料夾路徑。下載路徑 連結:https://pan.baidu.com/s/1NnMdRfEXdwBo-ltpP-J4Sw 提取碼:jqnx 

WebDriver driver = TestChromeDriver.getChromeDriver(chromebin,chromedriver,userdata);

驅動下載之後隨便安裝在哪個盤裡都可以,但是路徑一定要記得。

谷歌瀏覽器下載之後點選一下,自動幫你安裝在C盤,桌面也有圖示顯示,chromebin和userdata的路徑可以通過圖示屬性找到。

四、將要使用的引數放在config.properties配置檔案下

#這裡的三個引數就是連線資料庫用的
db_url=jdbc:mysql://localhost:3306/ziyuan?useUnicode=true&characterEncoding=utf-8
db_username=root
db_password=962464

#這裡就是剛剛說的三個路徑
chromebin=C:\\Users\\hasee\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe
chromedriver=G:\\new\\chromedriver\\chromedriver.exe
userdata=C:\\Users\\hasee\\AppData\\Local\\Google\\Chrome\\User Data

#資料庫表名
db_table=shimo

#爬取的連結
guochan=https://shimo.im/doc/iKYXMBsZ5x0kui8P

 五、連線資料庫的JavaBean

public class DataSourceModel {
private String url;
private String username;
private String password;

DataSourceModel(){

}

public String getUrl() {
return url;
}

public void setUrl(String url) {
this.url = url;
}

public String getUsername() {
return username;
}

public void setUsername(String username) {
this.username = username;
}

public String getPassword() {
return password;
}

public void setPassword(String password) {
this.password = password;
}
}

 六、爬蟲儲存到資料庫的JavaBean

在mysql資料庫裡建立一個表格

public class Shimo {
private String name;
private String url;
private String createtime;
private String updatetime;
private String path;
private String rengong;
private String type;

public String getType() {
return type;
}

public void setType(String type) {
this.type = type;
}

public String getName() {
return name;
}

public void setName(String name) {
this.name = name;
}

public String getUrl() {
return url;
}

public void setUrl(String url) {
this.url = url;
}

public String getCreatetime() {
return createtime;
}

public void setCreatetime(String createtime) {
this.createtime = createtime;
}

public String getUpdatetime() {
return updatetime;
}

public void setUpdatetime(String updatetime) {
this.updatetime = updatetime;
}

public String getPath() {
return path;
}

public void setPath(String path) {
this.path = path;
}

public String getRengong() {
return rengong;
}

public void setRengong(String rengong) {
this.rengong = rengong;
}
}

七、Processor類
public class ShiMo2ChromeProcessor implements PageProcessor {

static Properties properties;
static DataSourceModel dataSourceModel;
static String chromebin;
static String chromedriver;
static String userdata;
static String table;
static String runTime;

static String quanji;
static String guochan;
static String oumei;
static String yingdan;
static String dongmanbl;
static String taiguoyuenanyindu;
static String hanguo;
static String riben;

static{
properties=Utils.loadConfig("/config.properties");
dataSourceModel=new DataSourceModel();
dataSourceModel.setUrl(properties.getProperty("db_url"));
dataSourceModel.setUsername(properties.getProperty("db_username"));
dataSourceModel.setPassword(properties.getProperty("db_password"));

chromebin=properties.getProperty("chromebin");
chromedriver=properties.getProperty("chromedriver");
userdata=properties.getProperty("userdata");
table=properties.getProperty("db_table");
runTime=properties.getProperty("runTime");


quanji=properties.getProperty("quanji");
guochan=properties.getProperty("guochan");
oumei=properties.getProperty("oumei");
yingdan=properties.getProperty("yingdan");
dongmanbl=properties.getProperty("dongmanbl");
taiguoyuenanyindu=properties.getProperty("taiguoyuenanyindu");
hanguo=properties.getProperty("hanguo");
riben=properties.getProperty("riben");
}

private String keyWord;



private Site site = Site
.me()
.setCharset("UTF-8")
.setCycleRetryTimes(3)
.setSleepTime(3 * 1000)
.addHeader("Connection", "keep-alive")
.addHeader("Cache-Control", "max-age=0")
.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0");


public ShiMo2ChromeProcessor() {
}

@Override
public Site getSite() {
return site;
}


@Override
public void process(Page page){
WebDriver driver = TestChromeDriver.getChromeDriver(chromebin,chromedriver,userdata);
driver.manage().window().maximize();//視窗最大化
try {
driver.get(page.getRequest().getUrl());
Thread.sleep(10000);//10s
//查詢下一個連結
ananyDetail(driver);

driver.quit();//瀏覽器退出

} catch (Exception e) {
e.printStackTrace();
driver.quit();//瀏覽器退出
}
}




public static void ananyDetail(WebDriver driver) throws Exception{
//型別
String type=driver.getTitle();

List<WebElement> list=driver.findElements(By.className("gutter-author-6748903"));

for(WebElement webElement:list){
try {
List font=webElement.findElements(By.tagName("font"));
if(font.isEmpty()){
continue;
}
String font1=webElement.findElements(By.tagName("font")).get(0).getText().trim();
if(font1.startsWith("點")&&!font1.startsWith("點選")){
//分析頁面
String text= "";
String name= "";
String pwd= "";

try {
text = webElement.getText().replace("☞","").replace("點","");
// if(text!=null){
// text=text.replace(" ","|");
// }
if(text.contains("密碼")){
//拆分
String[] nameAndPwd=text.split("密碼");
name=nameAndPwd[0];
pwd="密碼"+nameAndPwd[nameAndPwd.length-1];

}else{
//不拆分
name=text;
pwd="";
}

} catch (Exception e) {
e.printStackTrace();
}

WebElement aTag= null;
try {
aTag = webElement.findElement(By.tagName("a"));
} catch (Exception e) {
e.printStackTrace();
}

//分析url
String url="";
try {
if(aTag!=null){
url=aTag.getAttribute("href");
}
} catch (Exception e) {
e.printStackTrace();
}
Shimo shimo=new Shimo();
shimo.setPath(driver.getCurrentUrl());
shimo.setName(name.trim());
String prefix="";
if(url.contains("pan.baidu")){
prefix="百度網盤:";
}else{
prefix="連結:";
}

shimo.setUrl(prefix+url.trim()+" "+pwd.trim());
shimo.setType(type);
saveDb(shimo);

}

} catch (Exception e) {
e.printStackTrace();
continue;
}
}
}

public static void saveDb(Shimo shimo){
Connection connection=null;
try {
//入資料庫
connection=Utils.getConnection(dataSourceModel);
//先查詢是否存在
SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd");
String querySql="select count(1) as totalnum from "+table+" where name='#name'";
querySql=querySql.replace("#name",shimo.getName());
int count=Utils.excuteCountQuery(connection,querySql);
if(count<=0){
//插入
String sql="insert into "+table+" (name,url,createtime,path,rengong,type) values ('#name','#url','#createtime','#path','#rengong','#type')";

sql=sql.replace("#name",shimo.getName())
.replace("#url",shimo.getUrl())
.replace("#createtime",sdf.format(new Date()))
.replace("#path",shimo.getPath())
.replace("#type",shimo.getType())
.replace("#rengong","0");
Utils.saveDb(connection,sql);
}else{
//更新
String updateSql="update "+table+" set url='#url',updatetime='#updatetime',path='#path',type='#type' where name='#name' and rengong='0'";

updateSql=updateSql.replace("#name",shimo.getName())
.replace("#url",shimo.getUrl())
.replace("#updatetime",sdf.format(new Date()))
.replace("#type",shimo.getType())
.replace("#path",shimo.getPath());
Utils.saveDb(connection,updateSql);
}
} catch (Exception e) {
System.out.println("入庫失敗");
e.printStackTrace();
}finally {
if(connection!=null){
try {
connection.close();
} catch (SQLException e) {
e.printStackTrace();
}
}

}
}



public static void main(String[] args){

System.out.println("++++++++系統啟動中...");
Map<String,Boolean> map=new HashMap<>();
while(true){
System.out.println("++++++++系統執行中...");
SimpleDateFormat simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd");
String today=simpleDateFormat.format(new Date());//今天

SimpleDateFormat sdf=new SimpleDateFormat("HH");
String nowTime=sdf.format(new Date());

//當天沒有跑過,且時間到了06點。
//跑過之後,將標識改為true
//if((map.get(today)==null||map.get(today)==false)&&runTime.equals(nowTime)){
if(true){
map.put(today,new Boolean(true));
System.out.println("++++++++資料抓取中...");
//早晨6點開始跑
Spider spider1=Spider.create(new ShiMo2ChromeProcessor());
spider1.addUrl(guochan)
.setDownloader(new HttpClientDownloader())
.thread(1)
.run();

}
try {
Thread.sleep(600000);//10分鐘跑一次
} catch (InterruptedException e) {
e.printStackTrace();
}
}

}
}

八、TestChromeDriver
public class TestChromeDriver {


public static WebDriver getChromeDriver(String chromebin,String chromedriver,String userdata ) {


/* 設定 chrome啟動檔案的位置, 若未設定則取預設安裝目錄的 chrome */
System.setProperty("webdriver.chrome.bin", chromebin);
/* 設定 chrome webdirver 的位置 ,若未設定則從path變數讀取*/
System.setProperty("webdriver.chrome.driver", chromedriver);

ChromeOptions chromeOption=new ChromeOptions();
chromeOption.addArguments("--user-data-dir="+userdata);
// chromeOption.addArguments("--headless");
chromeOption.addArguments("--no-sandbox");
WebDriver driver = new ChromeDriver(chromeOption);
return driver;
}

}

九、工具類
public class Utils {

public static Properties loadConfig(String configFile) {
InputStream input = null;
Properties properties = new Properties();
try {
input = Utils.class.getResourceAsStream(configFile);
properties.load(input);
} catch (Exception e) {
System.out.println("配置檔案載入失敗");
} finally {
if(input != null) {
try {
input.close();
} catch (IOException e) {
e.printStackTrace();
}
}

}
return properties;
}

public static Connection getConnection(DataSourceModel dataSourceModel){
Connection conn=null;
try {
Class.forName("com.mysql.jdbc.Driver");
conn= DriverManager.getConnection(dataSourceModel.getUrl(), dataSourceModel.getUsername(), dataSourceModel.getPassword());
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
return conn;
}


public static void saveDb(Connection connection,String sql){
Statement statement=null;
try {
if(connection!=null){
statement=connection.createStatement();
statement.executeUpdate(sql);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if(statement!=null)
statement.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}

public static int excuteCountQuery(Connection connection,String sql){
int rowCount=0;
Statement statement=null;
ResultSet resultSet=null;
try {
statement=connection.createStatement();
resultSet=statement.executeQuery(sql);
while(resultSet.next()){
rowCount = resultSet.getInt("totalnum");
}

} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if(resultSet!=null)
resultSet.close();
if(statement!=null)
statement.close();
} catch (SQLException e) {
e.printStackTrace();
}

}
return rowCount;

}

}