Python爬蟲系列:騰訊課堂Scrapy爬蟲
阿新 • • 發佈:2019-01-30
業務需求:
需要爬取騰訊課堂IT.網際網路類別下的雲端計算大資料子類別下的所有課程資料:
課程名稱、價格、購買人數、機構名稱
1、編寫item.py檔案
定義要爬取的資料欄位:
import scrapy
class TxktcrawlerItem(scrapy.Item):
# define the fields for your item here like:
title=scrapy.Field()
users=scrapy.Field()
price=scrapy.Field()
agency=scrapy.Field()
2、在mysql中建表
因為需要將爬取到的資料儲存到mysql中,所以首先在mysql中建表:
use test;
create table txkt(
id int unsigned auto_increment primary key,
title char(50),
users int(10),
price float(10),
agency char(50)
);
3、編寫pipelines.py檔案
將爬取到的資料儲存到mysql中
import pymysql class TxktcrawlerPipeline(object): def __init__(self): self.conn=pymysql.connect(host="127.0.0.1", user="sunbin", passwd="100200", db="test", charset="utf8") def process_item(self, item, spider): for j in range(1,len(item["title"])): title=item["title"][j] users=item["users"][j] price=item["price"][j] agency=item["agency"][j] cursor = self.conn.cursor() sql="insert into txkt(title,users,price,agency) values('"+title+"','"+users+"','"+ \ price+"','"+agency+"');" cursor.execute(sql) self.conn.commit() return item def close_spider(self,spider): self.conn.close()
4、settings.py檔案設定
開啟pipelines
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'txktcrawler.pipelines.TxktcrawlerPipeline': 300,
}
5、spiders檔案編寫
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from txktcrawler.items import TxktcrawlerItem class TxktSpider(scrapy.Spider): name = 'txkt' allowed_domains = ['ke.qq.com'] start_urls = ['https://ke.qq.com/course/list?mt=1001&st=2007'] def parse(self, response): item=TxktcrawlerItem() item['title']=response.xpath('//div[@class="market-bd market-bd-6 course-list course-card-list-multi-wrap"]//h4[@class="item-tt"]/a/@title').extract() print(item['title']) item['users']=response.xpath('//span[@class="line-cell item-user"]/text()').extract() print(item['users']) item['price']=response.xpath('//div[@class="item-line item-line--bottom"]/span/text()').extract() print(item['price']) item['agency']=response.xpath('//span[@class="item-source"]/a/@title').extract() print(item['agency']) yield item for i in range(1,35): nexturl="https://ke.qq.com/course/list?mt=1001&st=2007&task_filter=0000000&&page="+str(i) yield Request(nexturl,callback=self.parse)
說明:利用for迴圈爬取全部35頁的資料~
6、爬取結果示例