JAVA-正則提取img標籤src屬性中請求協議、域名、圖片字尾
阿新 • • 發佈:2019-02-18
正則表示式:src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)
示例程式碼:
Pattern pattern = Pattern.compile("src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)");
Matcher matcher = pattern.matcher(url);
while(matcher.find()){
System.out.println("-------------------");
String host = matcher.group(4);
String imgUrl = matcher.group(2) + matcher.group(3) + matcher.group(4) + matcher.group(5) + matcher.group(6) + "." + matcher.group(7);
System.out.println(host);
System.out.println(imgUrl);
}
}
正則表示式拆分:src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)
- “src=”:匹配文字中src=開頭
- $1“(\"|'| |)”:匹配src=" 或 src=' 或 src= 或src=空格,舉個例子:src='https://*****.png';src="https://*****.png";src=https://*****.png;src= https://*****.png
- $2“([\\S]{1,}?|[/]{1,})”:[\\S]{1,}?匹配協議,任意非空白字元出現一次或多次非貪婪模式,[/]{1,}匹配單斜線開頭或多斜線開頭,一般圖片url為了遵循源站的協議,預設使用//img.xxx.com/imgs/test.png這種格式,這一段正則相容http:、https:、ftp:、或/(此處正則只能獲取到單斜線)
- $3“([/]{1,})”:匹配協議後面的斜線:例如https://、http://、//
- $4+$5“(.+?)([/]{1,})”:(.+?)匹配https://img.xxx.com/imgs/test.png,從協議/匹配到下一個/,中間的即為域名資訊,$4=img.xxx.com;$5=/
- $6+$7“(.+?)\\.(png|jpg|jpeg)”:匹配https://img.xxx.com/imgs/test.png,$6=imgs/test;$7=png;\\.匹配純文字的.;\\為轉義符
- $8“(\"|'| |/>”:匹配src屬性的結尾,同$1作用,匹配以:"、'、空格、/>結尾的字元
示例程式碼截圖:
執行結果:
- -------------------
- path1
- host1/path1/name1.jpg
- '
- host1
- /
- path1
- /
- name1
- jpg
- -------------------
- -------------------
- paht2
- host2/paht2/name2.png
- '
- host2
- /
- paht2
- /
- name2
- png
- -------------------
- -------------------
- path3
- host3/path3/name3.png
- host3
- /
- path3
- /
- name3
- png
- -------------------
- -------------------
- imgsa.baidu.com
- //imgsa.baidu.com/exp/w=480/sign=306a19aebb3533faf5b6922698d2fdca/1ad5ad6eddc451daa8799c4bbcfd5266d1163286.jpg
- "
- /
- /
- imgsa.baidu.com
- /
- exp/w=480/sign=306a19aebb3533faf5b6922698d2fdca/1ad5ad6eddc451daa8799c4bbcfd5266d1163286
- jpg
- -------------------
- -------------------
- static.228.cn
- http://static.228.cn/upload/Image/201705/1496220590906_8212_x.jpg
- "
- http:
- //
- static.228.cn
- /
- upload/Image/201705/1496220590906_8212_x
- jpg
- -------------------
- -------------------
- static.228.cn
- //static.228.cn/upload/Image/201705/1496220556164_5314_x.jpg
- "
- /
- /
- static.228.cn
- /
- upload/Image/201705/1496220556164_5314_x
- jpg
- -------------------