1. 程式人生 > >爬蟲爬取電影天堂電影連結

爬蟲爬取電影天堂電影連結

 

比較熱愛python,最近在用eclipse寫java web,那就使用eclipse+PyDv配置環境,小試一次爬蟲吧~

看電影還要到處找資源,索性自己直接爬取電影連結,只要在迅雷上crtl+c/v就可以邊播邊下了~

僅以用來學習娛樂呦~~

進入正題:

  網頁開啟電影天堂,發現是一個非常好解析的網頁,它網頁元素構成簡單,容易理解,爬取就不用太多驗證控制,首先下載網頁原始碼,然後儲存在txt檔案中,然後用python讀取,進行獲取。(完整精簡程式碼在文後)

  

# -*- coding: utf-8 -*-
import pandas as pd
import
numpy as np import urllib from urllib import request from bs4 import BeautifulSoup from pandas.tests.frame.test_validate import dataframe import requests import sys import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8') url =r'e:\tiantang.txt' resp=open(url,encoding='gb18030',errors='
ignore') s = BeautifulSoup(resp,"lxml") mingzitag=s.findAll('a') #print(mingzitag) href=[] mingzis=[] for tag in mingzitag: mingzis.append(str(tag.text)) href.append(tag['href']) #print(href) view=pd.DataFrame(mingzis,columns=['NAME']) view2=pd.DataFrame(href,columns=['URL']) view3=view.join(view2)
print(view3)

結果:

                              NAME                                              URL

0                                                                       /index.html

1                             最新影片   http://www.ygdy8.net/html/gndy/dyzz/index.html

2                             經典影片        http://www.ygdy8.net/html/gndy/index.html

3                             國內電影  http://www.ygdy8.net/html/gndy/china/index.html

4                             歐美電影  

16                            本站首頁                            http://www.dytt8.net/

17                              電影                            /html/gndy/index.html

18                            最新電影                       /html/gndy/dyzz/index.html

19                            日韓電影                      /html/gndy/rihan/index.html

20                            歐美電影                      /html/gndy/oumei/index.html

21                            國內電影                      /html/gndy/china/index.html

22                            綜合電影                       /html/gndy/jddy/index.html

23                  2017年科幻動作《全球風暴              /html/gndy/dyzz/20180103/55959.html

24                  2017年高分恐怖《小丑回魂              /html/gndy/dyzz/20171230/55920.html

25                  2017年驚悚懸疑《雪人/雪              /html/gndy/jddy/20171224/55874.html

26                 2017年喜劇《性別之戰》BD              /html/gndy/jddy/20171222/55861.html

73      2018年劇情動作《反貪風暴3/L風暴》HD國語中字              /html/gndy/dyzz/20181026/57680.html

74    2018年8.0分動畫喜劇《超人總動員2》HD中英雙字幕              /html/gndy/dyzz/20181025/57675.html

75                             [2]                                   list_23_2.html

76                             [3]                                   list_23_3.html

77                             [4]                                   list_23_4.html

85                           電影APP                   https://www.dytt8.net/app.html

86                            下載宣告                                      /index.html

87                            網站地圖                               /plus/sitemap.html

.....

刪除無關資訊

#把無關資訊刪除
view3=view3.drop([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22])
view3=view3.drop([75,76,77,78,79,80,81,82,83,84,85,86,87])
view3.reset_index(drop=True,inplace=True)
#print(view3)

                              NAME                                      URL

0                   2017年科幻動作《全球風暴      /html/gndy/dyzz/20180103/55959.html

1                   2017年高分恐怖《小丑回魂      /html/gndy/dyzz/20171230/55920.html

2                   2017年驚悚懸疑《雪人/雪      /html/gndy/jddy/20171224/55874.html

3                  2017年喜劇《性別之戰》BD      /html/gndy/jddy/20171222/55861.html

4                   2017年甄子丹劉德華動作《      /html/gndy/dyzz/20171218/55832.html

5                   2017年詹妮弗·勞倫斯懸疑      /html/gndy/dyzz/20171208/55749.html

6                   2017年動作喜劇《寶貝特攻      /html/gndy/jddy/20171206/55738.html

7                  2017年8.3高分戰爭《敦刻      /html/gndy/dyzz/20171205/55728.html

8                   2017年高分獲獎劇情《相愛      /html/gndy/dyzz/20171205/55726.html

9                   2017年高分劇情《天才槍手      /html/gndy/dyzz/20171202/55684.html

10                  2017年愛情喜劇《胡楊的夏      /html/gndy/jddy/20171202/55682.html

11                  2017年動作喜劇《王牌特工      /html/gndy/dyzz/20171130/55665.html

12                  2017年動作《英倫對決》國      /html/gndy/dyzz/20171130/55664.html

13                  2017年動作《美國刺客/美      /html/gndy/dyzz/20171123/55603.html

14                  2017年懸疑動作《極寒之城      /html/gndy/dyzz/20171105/55468.html

15                  2011臺灣最新偶像劇《旋風        /html/tv/hytv/20110620/32833.html

16                  2010最新臺灣偶像劇《愛似  /html/tv/gangtai/tw/20101227/30016.html

17                  2010潘瑋柏最新偶像劇《愛  /html/tv/gangtai/tw/20100823/27737.html

18                  2010臺灣熱播偶像劇《鍾無     /html/tv/gangtai/20100726/27236.html

19                  2010熱播偶像劇《呼叫大明  /html/tv/gangtai/tw/20100517/26035.html

20                  2010臺灣偶像劇《就想賴著  /html/tv/gangtai/tw/20100118/24051.html

21                  2009熱播偶像劇《下一站幸  /html/tv/gangtai/tw/20091005/22040.html

22                  2009臺灣八大劇《桃花小妹  /html/tv/gangtai/tw/20091015/22259.html

23         2018年劇情歷史《7月22日》BD中英雙字幕      /html/gndy/dyzz/20181108/57761.html

24         2018年動作喜劇《歐洲攻略》BD國粵雙語中字      /html/gndy/dyzz/20181108/57760.html

25     2018年高分動作《碟中諜6:全面瓦解》HD中英雙字幕      /html/gndy/dyzz/20181107/57755.html

26          2018年劇情戰爭《颶風行動》BD中英雙字幕      /html/gndy/dyzz/20181106/57753.html

27         2018年劇情犯罪《你給的仇恨》HD中英雙字幕      /html/gndy/dyzz/20181106/57748.html

28    2018年高分動畫喜劇《超人總動員2》BD英國粵三語雙字      /html/gndy/dyzz/20181104/57728.html

29           2018年奇幻動畫《朝花夕誓》BD日語中字      /html/gndy/dyzz/20181104/57726.html

30     2018年科幻動作《巨齒鯊/極悍巨鯊》HD國英雙語雙字      /html/gndy/dyzz/20181104/57722.html

31  2018年動畫喜劇《精靈旅社3:瘋狂假期》BD英國粵三語雙字      /html/gndy/dyzz/20181103/57720.html

32     2018年高分奇幻《與神同行:罪與罰》BD韓粵雙語中字      /html/gndy/dyzz/20181102/57714.html

33    2018年高分奇幻《與神同行2:因與緣》BD韓語中英雙字      /html/gndy/dyzz/20181102/57713.html

34       2018年懸疑恐怖《修女/招魂外傳》BD中英雙字幕      /html/gndy/dyzz/20181102/57712.html

35          2018年喜劇《西虹市首富》HD國語中英雙字      /html/gndy/dyzz/20181101/57703.html

36       2018年劇情《巴比龍/逃離惡魔島》BD中英雙字幕      /html/gndy/dyzz/20181101/57702.html

37    2018年動作驚悚《伸冤人2/私刑教育2》BD中英雙字幕      /html/gndy/dyzz/20181101/57700.html

38            2018年愛情喜劇《牽線》BD中英雙字幕      /html/gndy/dyzz/20181031/57699.html

39           2018年劇情《三角草的春天》BD日語中字      /html/gndy/dyzz/20181031/57698.html

40            2018年驚悚恐怖《幼兒怨》BD粵語中字      /html/gndy/dyzz/20181029/57694.html

41      2018年高分喜劇《克里斯托弗·羅賓》BD中英雙字幕      /html/gndy/dyzz/20181028/57688.html

42         2018年動作歷史戰爭《大轟炸》HD中英雙字幕      /html/gndy/dyzz/20181028/57686.html

43  2018年高分劇情愛情《冷戰/沒有煙硝的愛情》BD中英雙字幕      /html/gndy/dyzz/20181027/57685.html

44      2018年冒險劇情《阿爾法:狼伴歸途》BD中英雙字幕      /html/gndy/dyzz/20181026/57682.html

45        2018年高分歷史戰爭《冒牌上尉》BD中英雙字幕      /html/gndy/dyzz/20181026/57681.html

46      2018年劇情動作《反貪風暴3/L風暴》HD國語中字      /html/gndy/dyzz/20181026/57680.html

47    2018年8.0分動畫喜劇《超人總動員2》HD中英雙字幕      /html/gndy/dyzz/20181025/57675.html

 

發現這些連結只有後半段,缺少前半段,需要加上前半段url

wanzheng=[]
t=view3['URL']
for row in t:
    src='https://www.dytt8.net'+row
    wanzheng.append(src)
#print(wanzheng)

['https://www.dytt8.net/html/gndy/dyzz/20180103/55959.html', 'https://www.dytt8.net/html/gndy/dyzz/20171230/55920.html', 'https://www.dytt8.net/html/gndy/jddy/20171224/55874.html', 'https://www.dytt8.net/html/gndy/jddy/20171222/55861.html', 'https://www.dytt8.net/html/gndy/dyzz/20171218/55832.html', 'https://www.dytt8.net/html/gndy/dyzz/20171208/55749.html', 'https://www.dytt8.net/html/gndy/jddy/20171206/55738.html', 'https://www.dytt8.net/html/gndy/dyzz/20171205/55728.html', 'https://www.dytt8.net/html/gndy/dyzz/20171205/55726.html', 'https://www.dytt8.net/html/gndy/dyzz/20171202/55684.html', 'https://www.dytt8.net/html/gndy/jddy/20171202/55682.html', 'https://www.dytt8.net/html/gndy/dyzz/20171130/55665.html', 'https://www.dytt8.net/html/gndy/dyzz/20171130/55664.html', 'https://www.dytt8.net/html/gndy/dyzz/20171123/55603.html', 'https://www.dytt8.net/html/gndy/dyzz/20171105/55468.html', 'https://www.dytt8.net/html/tv/hytv/20110620/32833.html', 'https://www.dytt8.net/html/tv/gangtai/tw/20101227/30016.html', 'https://www.dytt8.net/html/tv/gangtai/tw/20100823/27737.html', 'https://www.dytt8.net/html/tv/gangtai/20100726/27236.html', 'https://www.dytt8.net/html/tv/gangtai/tw/20100517/26035.html', 'https://www.dytt8.net/html/tv/gangtai/tw/20100118/24051.html', 'https://www.dytt8.net/html/tv/gangtai/tw/20091005/22040.html', 'https://www.dytt8.net/html/tv/gangtai/tw/20091015/22259.html', 'https://www.dytt8.net/html/gndy/dyzz/20181108/57761.html', 'https://www.dytt8.net/html/gndy/dyzz/20181108/57760.html', 'https://www.dytt8.net/html/gndy/dyzz/20181107/57755.html', 'https://www.dytt8.net/html/gndy/dyzz/20181106/57753.html', 'https://www.dytt8.net/html/gndy/dyzz/20181106/57748.html', 'https://www.dytt8.net/html/gndy/dyzz/20181104/57728.html', 'https://www.dytt8.net/html/gndy/dyzz/20181104/57726.html', 'https://www.dytt8.net/html/gndy/dyzz/20181104/57722.html', 'https://www.dytt8.net/html/gndy/dyzz/20181103/57720.html', 'https://www.dytt8.net/html/gndy/dyzz/20181102/57714.html', 'https://www.dytt8.net/html/gndy/dyzz/20181102/57713.html', 'https://www.dytt8.net/html/gndy/dyzz/20181102/57712.html', 'https://www.dytt8.net/html/gndy/dyzz/20181101/57703.html', 'https://www.dytt8.net/html/gndy/dyzz/20181101/57702.html', 'https://www.dytt8.net/html/gndy/dyzz/20181101/57700.html', 'https://www.dytt8.net/html/gndy/dyzz/20181031/57699.html', 'https://www.dytt8.net/html/gndy/dyzz/20181031/57698.html', 'https://www.dytt8.net/html/gndy/dyzz/20181029/57694.html', 'https://www.dytt8.net/html/gndy/dyzz/20181028/57688.html', 'https://www.dytt8.net/html/gndy/dyzz/20181028/57686.html', 'https://www.dytt8.net/html/gndy/dyzz/20181027/57685.html', 'https://www.dytt8.net/html/gndy/dyzz/20181026/57682.html', 'https://www.dytt8.net/html/gndy/dyzz/20181026/57681.html', 'https://www.dytt8.net/html/gndy/dyzz/20181026/57680.html', 'https://www.dytt8.net/html/gndy/dyzz/20181025/57675.html']

這就是完整的連結了,但是得到連結,卻沒有電影名字,所以需要給每個連結上加上電影名字

view4=pd.DataFrame(wanzheng,columns=['URL'])
view5=view3[['NAME']]
view6=view5.join(view4)
#這是完整的電影名字和電影簡介頁面連結(進入這個簡介頁面才能找到下載電影資源的連結)

示例輸出如下:

                             NAME                                                URL

0                   2017年科幻動作《全球風暴  https://www.dytt8.net/html/gndy/dyzz/20180103/...

1                   2017年高分恐怖《小丑回魂  https://www.dytt8.net/html/gndy/dyzz/20171230/...

2                   2017年驚悚懸疑《雪人/雪  https://www.dytt8.net/html/gndy/jddy/20171224/...

3                  2017年喜劇《性別之戰》BD  https://www.dytt8.net/html/gndy/jddy/20171222/...

 

所以,我們需要進入簡介頁面內,獲取到完整的電影資源。(好像上一步有點做無用功,,)

 

ZJURL=[]
for row in view6['URL']: 
    resp=urllib.request.urlopen(row)
s = BeautifulSoup(resp,"lxml")
#按標籤層數進入實現精確查詢
    ftptag=s.findAll('table')
    for tag in ftptag:
        for tag2 in tag.findAll('tbody'):
            for tag3 in tag2.findAll('tr'):
                for tag4 in tag3.findAll('a'):
                    ZJURL.append(tag4['href'])
try:
    file = open(r'E:\ziyuan22.txt', 'w+',encoding='utf-8')
    for title in ZJURL:
        file.write(title+'                ')
#加些空格將上一條與下一條連結分開
finally:
    if file:
        file.close()
print('over')

儲存後的txt檔案大概長這樣:

好了,到這裡你就獲得了一份簡單的電影資源了!

 

現在附上完整精簡程式碼:

 1 # -*- coding: utf-8 -*-
 2 import pandas as pd
 3 import numpy as np
 4 import urllib
 5 from urllib import request
 6 from bs4 import BeautifulSoup
 7 from pandas.tests.frame.test_validate import dataframe
 8 import requests
 9 import sys
10 import io
11 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')
12 
13 url=r'https://xa.58.com/chuzu/?PGTID=0d100000-001e-365c-b1e8-2142aa50dcb5&ClickID=3'
14 page = request.Request(url)
15 page_info = request.urlopen(page).read().decode('utf-8')
16 
17 s = BeautifulSoup(page_info,"lxml")
18 #在a標籤裡獲取連結
19 mingzitag=s.findAll('a')
20 href=[]
21 mingzi=[]
22 for tag in mingzitag:
23     mingzi.append(str(tag.text))
24     href.append(tag['href'])
25 #由於a標籤裡會有許多無用的標籤,所以先建立dataframe,以方便檢視無效資訊的索引,便於刪除
26 view=pd.DataFrame(mingzis,columns=['NAME'])
27 view2=pd.DataFrame(href,columns=['URL'])
28 view3=view.join(view2)
29 view3=view3.drop([1~22])
30 view3=view3.drop([75~87])
31 view3.reset_index(drop=True,inplace=True)
32 #將完整的電影簡介頁面的url找到,所以必須加入字首
33 wanzheng=[]
34 t=view3['URL']
35 for row in t:
36     src='https://www.dytt8.net'+row
37     wanzheng.append(src)
38 
39 #將完整url開啟找資源連結(ZJURL)
40 view4=pd.DataFrame(wanzheng,columns=['URL'])
41 ZJURL=[]
42 for row in view4['URL']: 
43     resp=urllib.request.urlopen(row)
44     s = BeautifulSoup(resp,"lxml")
45     ftptag=s.findAll('table')
46     for tag in ftptag:
47         for tag2 in tag.findAll('tbody'):
48             for tag3 in tag2.findAll('tr'):
49                 for tag4 in tag3.findAll('a'):
50                     ZJURL.append(tag4['href'])
51 try:
52     file = open(r'E:\最終電影資源.txt', 'w+',encoding='utf-8')
53     for title in ZJURL:
54         file.write(title+'                ')
55 finally:
56     if file:
57         file.close()
58 print('job Done!')