1. 程式人生 > >Python爬蟲系列:騰訊課堂Scrapy爬蟲

Python爬蟲系列:騰訊課堂Scrapy爬蟲

業務需求:

需要爬取騰訊課堂IT.網際網路類別下的雲端計算大資料子類別下的所有課程資料:

課程名稱、價格、購買人數、機構名稱


1、編寫item.py檔案

定義要爬取的資料欄位:

import scrapy

class TxktcrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    title=scrapy.Field()
    users=scrapy.Field()
    price=scrapy.Field()
    agency=scrapy.Field()

2、在mysql中建表

因為需要將爬取到的資料儲存到mysql中,所以首先在mysql中建表:

use test;
create table txkt(
	id int unsigned auto_increment primary key,
    title char(50),
    users int(10),
    price float(10),
    agency char(50)
);

3、編寫pipelines.py檔案

將爬取到的資料儲存到mysql中

import pymysql

class TxktcrawlerPipeline(object):
    def __init__(self):
        self.conn=pymysql.connect(host="127.0.0.1",
                                  user="sunbin",
                                  passwd="100200",
                                  db="test",
                                  charset="utf8")

    def process_item(self, item, spider):
        for j in range(1,len(item["title"])):
            title=item["title"][j]
            users=item["users"][j]
            price=item["price"][j]
            agency=item["agency"][j]
            cursor = self.conn.cursor()
            sql="insert into txkt(title,users,price,agency) values('"+title+"','"+users+"','"+ \
                price+"','"+agency+"');"
            cursor.execute(sql)
            self.conn.commit()
        return item

    def close_spider(self,spider):
        self.conn.close()

4、settings.py檔案設定

開啟pipelines

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'txktcrawler.pipelines.TxktcrawlerPipeline': 300,
}

5、spiders檔案編寫

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from txktcrawler.items import TxktcrawlerItem

class TxktSpider(scrapy.Spider):
    name = 'txkt'
    allowed_domains = ['ke.qq.com']
    start_urls = ['https://ke.qq.com/course/list?mt=1001&st=2007']

    def parse(self, response):
        item=TxktcrawlerItem()
        item['title']=response.xpath('//div[@class="market-bd market-bd-6 course-list course-card-list-multi-wrap"]//h4[@class="item-tt"]/a/@title').extract()
        print(item['title'])
        item['users']=response.xpath('//span[@class="line-cell item-user"]/text()').extract()
        print(item['users'])
        item['price']=response.xpath('//div[@class="item-line item-line--bottom"]/span/text()').extract()
        print(item['price'])
        item['agency']=response.xpath('//span[@class="item-source"]/a/@title').extract()
        print(item['agency'])
        yield item

        for i in range(1,35):
            nexturl="https://ke.qq.com/course/list?mt=1001&st=2007&task_filter=0000000&&page="+str(i)
            yield Request(nexturl,callback=self.parse)

說明:利用for迴圈爬取全部35頁的資料~

6、爬取結果示例