1. 程式人生 > >JAVA-正則提取img標籤src屬性中請求協議、域名、圖片字尾

JAVA-正則提取img標籤src屬性中請求協議、域名、圖片字尾

正則表示式:src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)

示例程式碼:

Pattern pattern = Pattern.compile("src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)");

Matcher matcher = pattern.matcher(url);

while(matcher.find()){

System.out.println("-------------------");

String host = matcher.group(4);

String imgUrl = matcher.group(2) + matcher.group(3) + matcher.group(4) + matcher.group(5) + matcher.group(6) + "." + matcher.group(7);

System.out.println(host);

System.out.println(imgUrl);

}

}

正則表示式拆分:src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)

  1. “src=”:匹配文字中src=開頭
  2. $1“(\"|'| |)”:匹配src=" 或 src=' 或 src= 或src=空格,舉個例子:src='https://*****.png';src="https://*****.png";src=https://*****.png;src= https://*****.png
  3. $2“([\\S]{1,}?|[/]{1,})”:[\\S]{1,}?匹配協議,任意非空白字元出現一次或多次非貪婪模式[/]{1,}匹配單斜線開頭或多斜線開頭,一般圖片url為了遵循源站的協議,預設使用//img.xxx.com/imgs/test.png這種格式,這一段正則相容http:、https:、ftp:、或/(此處正則只能獲取到單斜線)
  4. $3“([/]{1,})”:匹配協議後面的斜線:例如https://、http://、//
  5. $4+$5“(.+?)([/]{1,})”:(.+?)匹配https://img.xxx.com/imgs/test.png,從協議/匹配到下一個/,中間的即為域名資訊,$4=img.xxx.com;$5=/
  6. $6+$7(.+?)\\.(png|jpg|jpeg)”:匹配https://img.xxx.com/imgs/test.png,$6=imgs/test;$7=png;\\.匹配純文字的.;\\為轉義符
  7. $8“(\"|'| |/>”:匹配src屬性的結尾,同$1作用,匹配以:"、'、空格、/>結尾的字元
     

示例程式碼截圖:

執行結果:

  1. -------------------
  2. path1
  3. host1/path1/name1.jpg
  4. '
  5. host1
  6. /
  7. path1
  8. /
  9. name1
  10. jpg
  11. -------------------
  12. -------------------
  13. paht2
  14. host2/paht2/name2.png
  15. '
  16. host2
  17. /
  18. paht2
  19. /
  20. name2
  21. png
  22. -------------------
  23. -------------------
  24. path3
  25. host3/path3/name3.png
  26. host3
  27. /
  28. path3
  29. /
  30. name3
  31. png
  32. -------------------
  33. -------------------
  34. imgsa.baidu.com
  35. //imgsa.baidu.com/exp/w=480/sign=306a19aebb3533faf5b6922698d2fdca/1ad5ad6eddc451daa8799c4bbcfd5266d1163286.jpg
  36. "
  37. /
  38. /
  39. imgsa.baidu.com
  40. /
  41. exp/w=480/sign=306a19aebb3533faf5b6922698d2fdca/1ad5ad6eddc451daa8799c4bbcfd5266d1163286
  42. jpg
  43. -------------------
  44. -------------------
  45. static.228.cn
  46. http://static.228.cn/upload/Image/201705/1496220590906_8212_x.jpg
  47. "
  48. http:
  49. //
  50. static.228.cn
  51. /
  52. upload/Image/201705/1496220590906_8212_x
  53. jpg
  54. -------------------
  55. -------------------
  56. static.228.cn
  57. //static.228.cn/upload/Image/201705/1496220556164_5314_x.jpg
  58. "
  59. /
  60. /
  61. static.228.cn
  62. /
  63. upload/Image/201705/1496220556164_5314_x
  64. jpg
  65. -------------------