python 爬蟲獲取文件式網站資源（基於python 3.6）

阿新 • • 發佈：2017-08-15

codes 網頁大小 file sel dal 網頁代碼目錄多級目錄

import urllib.request

from bs4 import BeautifulSoup

from urllib.parse import urljoin

from Cat.findLinks import get_link

from Cat.Load import  Schedule

import os
import time
import errno

-------import的其余包代碼----------------

def get_link(page):  # 尋找鏈接的href
    linkData = []
    for page in page.find_all(‘td‘):
        links = page.select("a")
        for each in links:
            # if str(each.get(‘href‘))[:1] == ‘/‘: 過濾if代碼
                data=each.get(‘href‘)
                linkData.append(data)
    return(linkData)

def Schedule(a,b,c):  #當數據過大，加載顯示模塊
    ‘‘‘‘‘
    a:已經下載的數據塊
    b:數據塊的大小
    c:遠程文件的大小
   ‘‘‘
    per = 100.0 * a * b / c
    if per > 100 :
        per = 100
    print(‘%.2f%%‘ % per)
----------end-------------------



def mkdir_p(path):   #遞歸創建多級目錄
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5 (except OSError, exc: for Python <2.5)
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

def file_Down(connet,file):
    urllib.request.urlretrieve(connet, file, Schedule)

def decice(data):
    a = ‘/‘
    if a in data:
        return 1



def findAll(): #主函數
    url=‘http://www.nco.ncep.noaa.gov/pmb/codes/nwprod/nosofs.v3.0.4/‘
    page = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(page,‘lxml‘) #利用BeautifulSoup取得網頁代碼
    links=get_link(soup)
    # print(links)

    for childLink in range(len(links)-1):
        childLink =childLink +1
        connet = urljoin(url, links[childLink]) #拼接網址路徑
        page_next = urllib.request.urlopen(connet).read()
        soup_next = BeautifulSoup(page_next, ‘lxml‘)
        link_next=get_link(soup_next )   #第2次鏈接內的<a href=?
        file = os.path.join(‘D:\\test\\Index‘ + "\\" + links[childLink])
        # decice(links[childLink])
        # file_cre=os.path.join(‘D:\\test\\Index‘ ,links[childLink])
        if decice(links[childLink]):
            mkdir_p(file )
        else:
            file_Down(connet, file)

        print(connet)
        for child_next in range(len(link_next)-1):
            child_next =child_next +1
            connet_next=urljoin(connet,link_next[child_next] )
            page_next = urllib.request.urlopen(connet_next).read()
            soup_nextF = BeautifulSoup(page_next , ‘lxml‘)
            link_nextF = get_link(soup_nextF)  # 第3次鏈接內的<a href=?
            fileF = os.path.join(‘D:/test/Index‘ + "/", links[childLink]+link_next[child_next])
            if decice(links[childLink]):
                mkdir_p(fileF)
            else:
                file_Down(connet, fileF)
            print("Start : %s" % time.ctime())
            time.sleep(4)
            print("End : %s" % time.ctime())
            print(connet_next)
            for child_nextT in range(len(link_nextF )-1):
                child_nextT = child_nextT + 1
                connet_nextT = urljoin(connet_next, link_nextF[child_nextT])
                fileT = os.path.join(‘D:/test/Index‘ + "/", links[childLink] + link_next[child_next]+link_nextF[child_nextT] )
                if decice(link_nextF[child_nextT]) == 1:
                    mkdir_p(fileT)
                else:
                    file_Down(connet, fileT)
                print(connet_nextT)


if __name__ == ‘__main__‘:
    findAll()

python 爬蟲獲取文件式網站資源（基於python 3.6）

codes 網頁大小 file sel dal 網頁代碼目錄多級目錄 import urllib.requestfrom bs4 import BeautifulSoupfrom urllib.parse import urljoinfrom Cat.findLink

python 爬蟲獲取文件式網站資源完整版（基於python 3.6）

sta 不支持 bytes ror 啟動 www des find parse <--------------------------------下載函數-----------------------------> import requestsimport t

python 復制多個文件到指定目錄（基於python 3.X）

__name__ std lena import print tex post res 目錄 import osimport shutildef copyPDF(): addressPDF = "E:/totally/FinancePDF/" f_list = os.lis

PHP獲取文件後綴名（提供7種方法）阿星小棧

blog path 一次總結 HP 元素 xpl extension 所有 1.$file = ‘x.y.z.png‘;echo substr(strrchr($file, ‘.‘), 1);解析：strrchr($file, ‘.‘) strrchr() 函數

python之獲取文件夾下文件的絕對路徑

listdir log end [] pri clas cnblogs utf usr #!/usr/bin/python #-*-conding:utf-8-*- #獲取目錄下文件的絕對路徑 import os def getabsroute(path): l

python 爬蟲 txt文件的讀取和寫入

python爬蟲的各種儲存方式之txt 簡單下載一個圖片： from urllib import request url="http://pic.netbian.com/uploads/allimg/180912/221007-15367614072cc2.jp

python 爬蟲 csv文件的儲存和讀取

python爬蟲的各種儲存方式之csv 2.csv的儲存和讀取判斷目錄，有則開啟，沒有新建 import csv import os #判斷目錄，有則開啟，沒有新建 if os.path.exists('D:\Python\程式碼\資料爬取'): os

Android實戰——第三方服務之Bmob後端雲的增刪改查、上傳文件、獲取文件、修改密碼（二）

tid blank 生成 src 上傳圖片放置第三方 b數 net 第三方服務之Bmob後端雲的增刪改查、上傳文件、獲取文件、修改密碼（二）事先說明：這裏的一切操作都是在集成了BmobSDK之後實現的，如果對Bmob還不了解的話，請關註我第一篇Bmob文章步

用Python讀取Word文件並寫入Excel（二）

對於從word文件中得到的資訊，我們往往需要寫入excel，以便我們做後續的資料處理。在此，我們同樣利用win32 的api，寫入excel的方法如下： def write_excel(workbook,i_in,name_in,Gender_in,Sch

用Python讀取Word文件並寫入Excel（一）

工作中經常要處理大量的word文件，大部分內容都很簡單，比如說做一個彙總表，從發來的word文件裡提取名字、聯絡方式、地址等資訊，提取完之後還需要用Excel做彙總，對於十幾份的文件尚好，但對於成百份，

系統引導文件之 boot.ini（有很多參數）

命名訪問 rdquo 企業版運行 pro 解釋點擊編號 Windows NT類的操作系統，也就是Windows NT/2000/XP中，有一個特殊文件，也就是“BOOT.INI”文件，這個文件會很輕松地按照我們的需求設置好多重啟動系統。 &l

Spring,SpringMVC,Mybatis等配置文件報錯解決（Referenced file contains errors）

info ems art valid mes ont conf window 程序　　今天自己搭建了ssm框架，頭文件什麽的都是拷貝的筆記的，本來不會出錯。可是偏偏報錯（如下）： Referenced file contains errors (http://www.i

Delphi 之路 — 文件操作函數（說明和使用說明）

-a ... ado 文件大小返回值系統文件 -- 大小可選　　Delphi 之路 — 文件操作函數（說明和使用說明） //判斷文件是否存在　　　　　　FileExists //判斷文件夾是否存在　　　　　　DirectoryExists //刪除文

Hyper-V數據文件丟失解決方案（有圖有真相）

數據恢復虛擬機 Hyper-V 虛擬化虛擬化數據恢復一、Hyper-V虛擬化故障概述 1、虛擬機環境故障虛擬化環境為ESXI虛擬化服務器，虛擬機環境，虛擬機的硬盤文件和配置文件放在北京某服務器托管公司的DELL MD3200存儲中（存儲由5塊容量為600G的硬盤組成raid磁盤陣列）

java web通過openoffice實現文件網頁預覽（類似百度文庫）

　　最近研究了一下在網頁上預覽文件（包括office文件和txt、pdf），發現用openoffice+FlexPlayer實現比較理想，就參考了https://blog.csdn.net/ITBigGod/article/details/80300177#commentBox這個部落格自己研究了一下。原始碼

Python爬蟲六：字型反爬處理（貓眼+汽車之家）-2018.10

環境：Windows7 +Python3.6+Pycharm2017 目標：貓眼電影票房、汽車之家字型反爬的處理 --------全部文章：京東爬蟲、鏈家爬蟲、美團爬蟲、微信公眾號爬蟲、字型反爬--------- 前言：字型反爬，

hjimmy 的文件： inode 介紹（來自維基百科）

inode是指在許多“類Unix檔案系統”中的一種資料結構。每個inode儲存了檔案系統中的一個檔案系統物件（包括檔案、目錄、裝置檔案、socket、管道, 等等）的元資訊資料，但不包括資料內容或者檔名[1]。目錄 [隱藏] 1 命名2 細節4 推論5 實際考慮6 參考文獻7 外部連結

python——wxpy模組實現微信尬聊（基於圖靈機器人）

wxpy（微信機器人）是在itchat基礎上開發的微信個人功能服務API，基本可以實現微信各種拓展功能，支援pip安裝，適用2.7以及3.4-3.6的python版本通過# 匯入模組 from wxpy import * # 初始化機器人，掃碼登陸 bot = Bot()即可

python 獲取文件md5

close () pytho os.path ash read def ret span def GetFileMd5(filename): if not os.path.isfile(filename): return myhash =

python獲取文件夾下數量

number not tor convert lsi lists sub main rect import os totalSize = 0 fileNum = 0 dirNum = 0 def visitDir(path): global totalSiz

python 爬蟲獲取文件式網站資源（基於python 3.6）

相關推薦