1. 程式人生 > >爬蟲實例1-爬取新聞列表和發布時間

爬蟲實例1-爬取新聞列表和發布時間

爬蟲 python 工程 import title

一、新建工程

scrapy startproject shop

二、Items.py文件代碼:

import scrapy

class ShopItem(scrapy.Item):

title = scrapy.Field()

time = scrapy.Field()

三、shopspider.py文件爬蟲代碼

# -*-coding:UTF-8-*-

import scrapy

from shop.items import ShopItem

class shopSpider(scrapy.Spider):

name = "shop"

allowed_domains = ["news.xxxxxxx.xx.cn"]

start_urls = ["http://news.xxxxx.xxx.cn/hunan/"]

def parse(self,response):

item = ShopItem()

item[‘title‘] = response.xpath("//div[@class=‘txttotwe2‘]/ul/li/a/text()").extract()

item[‘time‘] = response.xpath("//div[@class=‘txttotwe2‘]/ul/li/font/text()").extract()

yield item

四、pipelines.py文件代碼(打印出內容):

註意:如果在shopspider.py文件中打印出內容則顯示的是unicode編碼,而在pipelines.py打印出來的信息則是正常的顯示內容。

class ShopPipeline(object):

def process_item(self, item, spider):

count=len(item[‘title‘])

print ‘news count: ‘ ,count

for i in range(0,count):

print ‘biaoti: ‘+item[‘title‘][i]

print ‘shijian: ‘+item[‘time‘][i]

return item

五、爬取顯示的結果:

[email protected]:~/shop# scrapy crawl shop --nolog

news count: 40

biaoti: xxx建成國家食品安全示範城市

shijian: (2017-06-16)

biaoti: xxxx考試開始報名

……………………

…………………..


爬蟲實例1-爬取新聞列表和發布時間