全國(不包括港澳臺)行政區劃程式碼爬取
概述
網路爬蟲主要工作就是跟據指定的url地址 去傳送請求,獲得響應, 然後解析響應 , 一方面從響應中查找出想要查詢的資料,另一方面從響應中解析出新的URL路徑。
爬取目標
之前在驗證身份證是否符合規則,其中有一項是驗證前六位數是否是實際存在的區劃程式碼,就從國家統計局:http://www.stats.gov.cn/找了資料。最新的是2019年1月31號釋出的資料http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/index.html, 整體思路依次獲取省、市、縣/區、鎮、村的資訊,根據上一級的連結獲取下一級的資料。
爬取步驟
- 1、頁面分析:主頁面如下,用開發人員檢視工具分析頁面(F12可以直接調出來)
- 2、分析頁面HTML資訊:
a、獲取省的資訊,省的資訊是在一個class="provincetr"的tr裡面,要提取的資訊包括名稱、連結;
b、市、縣\區、鎮資訊頁面差不多,只有class不一樣,分別在class="citytr"、“countytr”、“towntr”的tr裡面,要提取的資訊包括名稱、連結、區劃程式碼,這裡注意一些區劃,比如市轄區是沒有下一級的,要特殊判斷一下。大部分區劃是市、縣\區、鎮的順序,也有一些沒有縣\區(貌似就廣東省中山市、東莞市,海南市儋州市),市直接到鎮,也要特殊處理;
c、村的資訊在class=“villagetr”的tr裡面,要提取的資訊包括名稱、城鄉分類程式碼、區劃程式碼;
以上獲取的資訊都可以經過上一級獲取的連結得到下一級的資訊,那麼就先獲取上一級的資料到CSV,再根據上一級的資料獲取下一級資料,省、市、縣\區數量比較少,可以直接獲取頁面資訊,鎮、村的數量比較多,加了代理、爬取過程隨機停留一會,分段爬取。以下是獲取省、市、縣\區資訊的程式碼:
1 import urllib.request 2 import requests 3 import csv 4 import random 5 from urllib import parse 6 from bs4 import BeautifulSoup 7 import time 8 user_agent_list = [ \ 9"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", \ 10"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \ 11"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \ 12"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \ 13"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \ 14"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \ 15"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \ 16"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ 17"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ 18"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ 19"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \ 20"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \ 21"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ 22"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ 23"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ 24"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \ 25"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \ 26"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" 27] 28 head = {'User-Agent': random.choice(user_agent_list)} 29 def get_province(url,file_str):#解析省頁面資訊,獲取省資訊,返回類似("北京市","http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/11.html")組成的連結串列 30province_list=[] 31my_html= requests.get(url, headers=head) 32my_html.encoding = 'GB2312' 33my_soup = BeautifulSoup(my_html.text, 'lxml') 34my_tr = my_soup.find_all("tr", class_="provincetr") 35for my_td in my_tr: 36my_a=my_td.find_all("td") 37for my_href in my_a: 38my_url=parse.urljoin(url,my_href.a["href"]) 39province_list.append((my_href.a.get_text(),my_url)) 40with open(file_str, 'a', newline='', encoding='gb2312',errors='ignore')as f: 41write = csv.writer(f) 42for province_item in province_list: 43write.writerow([1,province_item[0],province_item[1]]) 44f.close() 45 46 47 def get_info(url,class_str,file_str,*upper_name_list):#獲取市、縣\區、鎮資訊,url是要獲取資訊的網址,class_str是要提取資訊的html內容的class,upper_name_list是上級路徑,個數可變 48if(url==""): 49return 50info_list=[] 51head = {'User-Agent': random.choice(user_agent_list)} 52my_html= requests.get(url, headers=head) 53time.sleep(random.random())#隨機暫停01秒 54my_html.encoding = 'GB2312' 55my_soup = BeautifulSoup(my_html.text, 'lxml') 56my_tr = my_soup.find_all("tr", class_=class_str)#從html頁面提取型別class_str的元素 57for my_td in my_tr: 58if(my_td.find("td").a):#有些有連結的,有些沒有連結的 59my_href=my_td.find("td").a["href"] 60my_href=parse.urljoin(url,my_href) 61my_code=my_td.find("td").a.get_text() 62my_name=my_td.find("td").next_sibling.a.get_text() 63info_list.append((my_name,my_code,my_href)) 64else: 65my_href="" 66my_code=my_td.find("td").get_text() 67my_name=my_td.find("td").next_sibling.get_text() 68info_list.append((my_name,my_code,my_href)) 69with open(file_str, 'a', newline='', encoding='gb2312',errors='ignore')as f: 70write = csv.writer(f) 71for info_item in info_list: 72write.writerow([len(upper_name_list)+1,]+list(upper_name_list)+[info_item[0],info_item[1],info_item[2]]) 73f.close() 74return 1 75 76 77 if __name__ == '__main__': 78base_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/index.html"#連結網址 79get_province(base_url,"province_code.csv")#獲取省資訊 80 81province_list=[]#根據省連結獲取市資訊 82with open('province_code.csv','r')as f: 83read=csv.reader(f) 84for province in read: 85province_list.append(province) 86f.close() 87for province in province_list: 88print(province) 89#print(province[2],"citytr","city_code.csv",province[1]) 90city=get_info(province[2],"citytr","city_code.csv",province[1]) 91 92city_list=[]#根據市連結獲取縣\區資訊 93with open('city_code.csv','r')as f: 94read=csv.reader(f) 95for city in read: 96city_list.append(city) 97f.close() 98for city in city_list: 99print(city) 100#print(city[4],"countytr","county_code.csv",city[1],city[2]) 101town=get_info(city[4],"countytr","county_code.csv",city[1],city[2])
最後獲取省、市、縣\區的資訊如下
在獲取縣\區資訊的時候中途會有獲取失敗現象,有各種原因:Max retries exceeded with url、伺服器拒絕連線、我使用的網路本身會偶爾斷開一下,加了IP代理,使用requests.session,測試了幾次,獲取獲取縣\區資訊失敗的次數減少,但還是有獲取失敗。村的數量大概是縣\區的幾十倍,後來想了一個方法,獲取失敗的時候,暫停一會,繼續獲取,直到成功為止,程式碼大概如下:
try:#獲取頁面內容,如果失敗了,停止10s後,繼續獲取頁面內容 my_html= s.get(url, headers=head,proxies = proxy) except: print("#############################################################") time.sleep(10) a=get_info(url,class_str,file_str,*upper_name_list) else: ……
最終程式碼如下:
import urllib.request import requests import csv import random from urllib import parse from bs4 import BeautifulSoup import time user_agent_list = [ \ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", \ "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \ "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] user_ip_list = [ \ "http://171.41.80.142:9999", \ "http://171.41.80.231:9999", \ "http://112.85.172.58:9999", \ "http://111.79.199.161:9999", \ "http://110.52.235.184:9999", \ "http://110.52.235.198:9999", \ "http://122.193.244.244:9999", \ "http://223.241.78.26:8010", \ "http://110.52.235.54:9999", \ "http://116.209.53.214:9999", \ "http://112.85.130.221:9999", \ "http://60.190.250.120:8080", \ "http://183.148.151.218:9999", \ "http://183.63.101.62:53281", \ "http://112.85.164.249:9999", ] requests.adapters.DEFAULT_RETRIES = 5#重連次數 head = {'User-Agent': random.choice(user_agent_list)} proxy={'proxies': random.choice(user_ip_list)} def get_province(url,file_str):#解析省頁面資訊,獲取省資訊,返回類似("北京市","http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/11.html")組成的連結串列 province_list=[] my_html= requests.get(url, headers=head) my_html.encoding = 'GB2312' my_soup = BeautifulSoup(my_html.text, 'lxml') my_tr = my_soup.find_all("tr", class_="provincetr") for my_td in my_tr: my_a=my_td.find_all("td") for my_href in my_a: my_url=parse.urljoin(url,my_href.a["href"]) province_list.append((my_href.a.get_text(),my_url)) with open(file_str, 'a', newline='', encoding='gb2312',errors='ignore')as f: write = csv.writer(f) for province_item in province_list: write.writerow([1,province_item[0],province_item[1]]) f.close() def get_info(url,class_str,file_str,*upper_name_list):#獲取市、縣\區、鎮資訊,url是要獲取資訊的網址,class_str是要提取資訊的html內容的class,upper_name_list是上級路徑,個數可變 if(url==""): return info_list=[] head = {'User-Agent': random.choice(user_agent_list)} proxy={'proxies': random.choice(user_ip_list)} s = requests.session() s.keep_alive = False # 關閉多餘連線 try:#獲取頁面內容,如果失敗了,停止10s後,繼續獲取頁面內容 my_html= s.get(url, headers=head,proxies = proxy) except: print("#############################################################") time.sleep(10) a=get_info(url,class_str,file_str,*upper_name_list) else: #my_html= requests.get(url, headers=head) #time.sleep(random.random())#隨機暫停0~1秒 my_html.encoding = 'GB2312' my_soup = BeautifulSoup(my_html.text, 'lxml') my_tr = my_soup.find_all("tr", class_=class_str)#從html頁面提取型別class_str的元素 for my_td in my_tr: if(my_td.find("td").a):#有些有連結的,有些沒有連結的 my_href=my_td.find("td").a["href"] my_href=parse.urljoin(url,my_href) my_code=my_td.find("td").a.get_text() my_name=my_td.find("td").next_sibling.a.get_text() info_list.append((my_name,my_code,my_href)) else: my_href="" my_code=my_td.find("td").get_text() my_name=my_td.find("td").next_sibling.get_text() info_list.append((my_name,my_code,my_href)) with open(file_str, 'a', newline='', encoding='gb2312',errors='ignore')as f: write = csv.writer(f) for info_item in info_list: write.writerow([len(upper_name_list)+1,]+list(upper_name_list)+[info_item[0],info_item[1],info_item[2]]) f.close() return 1 def get_village(url,file_str,*upper_name_list):#獲取行政村的資訊,行政村和市、縣\區、鎮頁面有較大區別,獨立一個函式獲取行政村資訊 if(url==""): return village_list=[] head = {'User-Agent': random.choice(user_agent_list)} proxy={'proxies': random.choice(user_ip_list)} s = requests.session() s.keep_alive = False # 關閉多餘連線 try:#獲取頁面內容,如果失敗了,停止10s後,繼續獲取頁面內容 my_html= s.get(url, headers=head,proxies = proxy) except: print("#############################################################") time.sleep(10) a=get_village(url,file_str,*upper_name_list) else: #my_html= requests.get(url, headers=head) my_html.encoding = 'GB2312' my_soup = BeautifulSoup(my_html.text, 'lxml') my_tr = my_soup.find_all("tr",class_="villagetr")#從html頁面提取型別class_str的元素 for my_td in my_tr: my_code=my_td.find("td").get_text() my_class_code=my_td.find("td").next_sibling.get_text() my_name=my_td.find("td").next_sibling.next_sibling.get_text() village_list.append((my_name,my_class_code,my_code)) with open(file_str, 'a', newline='', encoding='gb2312',errors='ignore')as f: write = csv.writer(f) for village_item in village_list: write.writerow([len(upper_name_list)+1,]+list(upper_name_list)+[village_item[0],village_item[1],village_item[2]]) f.close() return 1 if __name__ == '__main__': base_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/index.html"#連結網址 get_province(base_url,"province_code.csv")#獲取省資訊 province_list=[]#根據省連結獲取市資訊 with open('province_code.csv','r')as f: read=csv.reader(f) for province in read: province_list.append(province) f.close() for province in province_list: print(province) #print(province[2],"citytr","city_code.csv",province[1]) city=get_info(province[2],"citytr","city_code.csv",province[1]) city_list=[]#根據市連結獲取縣\區資訊 with open('city_code.csv','r')as f: read=csv.reader(f) for city in read: city_list.append(city) f.close() for city in city_list: print(city) #print(city[4],"countytr","county_code.csv",city[1],city[2]) town=get_info(city[4],"countytr","county_code.csv",city[1],city[2]) city_list=[]#根據市連結獲取縣\區資訊,大部分區劃是市、縣\區、鎮的順序,也有一些沒有縣\區,市直接到鎮,貌似就廣東省中山市、東莞市,海南市儋州市 with open('city_code.csv','r')as f:#從市裡面獲取鎮資訊 read=csv.reader(f) for city in read: city_list.append(city) f.close() for city in city_list: print(city) #print(city[4],"countytr","county_code.csv",city[1],city[2]) town=get_info(city[4],"towntr","town_code_1.csv",city[1],city[2],"") county_list=[] with open('county_code.csv','r')as f:#從縣\區裡面獲取鎮資訊 read=csv.reader(f) for county in read: county_list.append(county) f.close() for county in county_list: print(county) #print(city[4],"countytr","county_code.csv",city[1],city[2]) town=get_info(county[5],"towntr","town_code_2.csv",county[1],county[2],county[3]) town_list=[] with open('town_code_1.csv','r')as f:#從縣\區裡面獲取鎮資訊 read=csv.reader(f) for town in read: town_list.append(town) f.close() with open('town_code_2.csv','r')as f:#從縣\區裡面獲取鎮資訊 read=csv.reader(f) for town in read: town_list.append(town) f.close() for town in town_list: print(town) #print(city[4],"countytr","county_code.csv",city[1],city[2]) village=get_village(town[6],"village_code.csv",town[1],town[2],town[3],town[4])#url在第7列,也就是town[6],town[1]~[4]是上級區劃
最後獲取了全部行政村的資料,64W左右: