1. 程式人生 > >python爬蟲學習之爬取全國各省市縣級城市郵政編碼

python爬蟲學習之爬取全國各省市縣級城市郵政編碼

例項需求:運用python語言在http://www.ip138.com/post/網站爬取全國各個省市縣級城市的郵政編碼,並且儲存在excel檔案中

例項環境:python3.7
       requests庫(內建的python庫,無需手動安裝)
       xlwt庫(需要自己手動安裝)

例項網站:

   第一步,在http://www.ip138.com/post/網站通過查詢原始碼可以找到各個省份的連結

    

     第二步,點選連結,即可看到所點選省份的城市的郵政編碼

    

    

例項程式碼:    

import requests
import
xlwt # 返回一個字典,鍵是各個省份的名字,值是對應省份的網址url def getProvinceCode(url): response = requests.get(url) response.encoding = response.apparent_encoding content = response.text start = content.find('<map name="map_86" id="map_86">') + len('<map name="map_86" id="map_86">') + len("
\n") end = content.find('</map>') mapStr = content[start:end] #print(mapStr) lines = mapStr.split("\n") baseUrl = 'http://www.ip138.com/' city_urls = [] city_name = [] for line in lines: if line: index1 = line.find('href="/') + len('href="/
') index2 = line.find('/"') code = line[index1:index2] url = baseUrl + code city_urls.append(url) title1 = line.find('title="')+len('title="') title2 = line.find('"', title1) title = line[title1:title2] city_name.append(title) dict_prov_url = dict(zip(city_name,city_urls)) for item in dict_prov_url.items(): # 顯示各個省份名稱和對應的url print(item) return dict_prov_url # 根據url得到省份的各個城市的城市名、郵政編碼以及長途區號,返回一個二維的列表。 def getPostCode(url): response = requests.get(url) response.encoding = response.apparent_encoding content = response.text start = content.find('長途區號</b></td></tr>') + len("長途區號</b></td></tr>") end = content.find('</table>', start) add_post = content[start:end] posts = add_post.strip().split('<tr bgcolor="#ffffff">') # posts為每一個去掉<tr bgcolor="#ffffff">組成的列表 code_list = [] for post in posts: if post: lines = post.strip().split('<td') if len(lines) >= 2: if 'nbsp' in lines[4]: if len(lines) >= 6: if 'nbsp' in lines[5]: test = [] city = lines[1][lines[1].find('>')+len('>'):lines[1].find('</')] post_code = lines[2][lines[2].find('">')+len('">'):lines[2].find('</')] area_code = lines[3][lines[3].find('">')+len('">'):lines[3].find('</')] test.append(city) test.append(post_code) test.append(area_code) code_list.append(test) else: test = [] city = lines[1][lines[1].find('<b>')+len('<b>'):lines[1].find('</')] post_code = lines[2][lines[2].find('">')+len('">'):lines[2].find('</')] area_code = lines[3][lines[3].find('">')+len('">'):lines[3].find('</')] test.append(city) test.append(post_code) test.append(area_code) code_list.append(test) else : test1 = [] city = lines[1][lines[1].find('>')+len('>'):lines[1].find('</')] post_code = lines[2][lines[2].find('">')+len('">'):lines[2].find('</')] area_code = lines[3][lines[3].find('">')+len('">'):lines[3].find('</')] test1.append(city) test1.append(post_code) test1.append(area_code) code_list.append(test1) test2 = [] city = lines[4][lines[4].find('>')+len('>'):lines[4].find('</')] post_code = lines[5][lines[5].find('">')+len('">'):lines[5].find('</')] area_code = lines[6][lines[6].find('">')+len('">'):lines[6].find('</')] test2.append(city) test2.append(post_code) test2.append(area_code) code_list.append(test2) showPost(code_list) return code_list # 在終端上顯示上面getPostCode(url)函式的得到二維的列表 def showPost(code_list): for i in range(len(code_list)): print(code_list[i]) # 寫入excel檔案 def write_excel(path): # 建立工作簿 workbook = xlwt.Workbook(encoding='utf-8') # 建立sheet for title,url in getProvinceCode('http://www.ip138.com/post/').items(): data_sheet = workbook.add_sheet(title) row0 = [u'城市名稱', u'郵政編碼', u'長途區號'] # 每個表的第一行文字,表頭 for i in range(len(row0)): data_sheet.write(0, i, row0[i]) code_list = getPostCode(url) for i in range(len(code_list)): # 迴圈寫入所有郵政編碼資訊 for j in range(len(code_list[i])): data_sheet.write(i+1,j,code_list[i][j]) workbook.save(path) if __name__ == '__main__': path = './postcode.xls' write_excel(path) print(u'寫入postcode.xls檔案成功')

例項結果:

  終端顯示:

  

   excel檔案: