1. 程式人生 > >Scrapy爬取慕課網(imooc)所有課程數據並存入MySQL數據庫

Scrapy爬取慕課網(imooc)所有課程數據並存入MySQL數據庫

start table ise utf-8 action jpg yield star root

爬取目標:使用scrapy爬取所有課程數據,分別為

1.課程名 2.課程簡介 3.課程等級 4.學習人數

並存入MySQL數據庫 (目標網址 http://www.imooc.com/course/list)

技術分享

一.導出數據文件到本地

1.新建imooc項目

1 scrapy startproject imooc

2.修改 items.py,添加項目item

1 from  scrapy import Item,Field
2 class ImoocItem(Item):
3     Course_name=Field()#課程名稱
4     Course_content=Field()#
課程內容 5 Course_level=Field()#課程等級 6 Course_attendance=Field()#課程學習人數

3.在 spiders目錄下制作爬蟲

vi imooc_spider.py

 1 # -*- coding: utf-8 -*-
 2 from scrapy.spiders import CrawlSpider
 3 from scrapy.selector import Selector
 4 from imooc.items import ImoocItem
 5 from scrapy.http import Request
 6
7 8 class Imooc(CrawlSpider): 9 name=imooc 10 allowed_domains = [imooc.com] 11 start_urls = [] 12 for pn in range(1,31): 13 url = http://www.imooc.com/course/list?page=%s % pn 14 start_urls.append(url) 15 16 def parse(self,response): 17 item=ImoocItem()
18 selector=Selector(response) 19 Course = selector.xpath(//a[@class="course-card"]) 20 21 for eachCourse in Course: 22 Course_name = eachCourse.xpath(div[@class="course-card-content"]/h3[@class="course-card-name"]/text()).extract()[0] 23 Course_content = eachCourse.xpath(div[@class="course-card-content"]/div[@class="clearfix course-card-bottom"]/p[@class="course-card-desc"]/text()).extract() 24 Course_level = eachCourse.xpath(div[@class="course-card-content"]/div[@class="clearfix course-card-bottom"]/div[@class="course-card-info"]/span/text()).extract()[0] 25 Course_attendance = eachCourse.xpath(div[@class="course-card-content"]/div[@class="clearfix course-card-bottom"]/div[@class="course-card-info"]/span/text()).extract()[1] 26 item[Course_name] = Course_name 27 item[Course_content] = ;.join(Course_content) 28 item[Course_level] = Course_level 29 item[Course_attendance] = Course_attendance 30 yield item

4.現在可以運行爬蟲把數據導出來,現在以cvs格式測試

1 scrapy crawl imooc -o data.csv -t csv

查看文件

技術分享

二.爬取數據並存入MySQL數據庫

1.這裏使用MySQL數據庫存儲數據,需要用到 MySQLdb包,確保已經安裝

首先建立數據庫和表

--創建數據庫
create database imooc DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

--創建表
create table imooc_info2(
title varchar(255) NOT NULL COMMENT 課程名稱,
content varchar(255) NOT NULL COMMENT 課程簡介,
level varchar(255) NOT NULL COMMENT 課程等級,
sums int  NOT NULL COMMENT 課程學習人數
)

2.修改pipelines.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 
 9 import json
10 from twisted.enterprise import adbapi
11 from scrapy import log
12 import MySQLdb
13 import MySQLdb.cursors
14 import codecs
15 
16 class ImoocPipeline(object):
17     def __init__(self):
18         self.file = codecs.open(imooc.json, w, encoding=utf-8)
19     def process_item(self, item, spider):
20         line = json.dumps(dict(item), ensure_ascii=False) + "\n"
21         self.file.write(line)
22         return item
23     def spider_closed(self, spider):
24         self.file.close()
25 
26 class MySQLPipeline(object):
27 
28     def __init__(self):
29         self.dbpool = adbapi.ConnectionPool("MySQLdb",
30                                            db = "imooc",            # 數據庫名
31                                            user = "root",           # 數據庫用戶名
32                                            passwd = "hwfx1234",     # 密碼
33                                            cursorclass = MySQLdb.cursors.DictCursor,
34                                            charset = "utf8",
35                                            use_unicode = True
36                                            )
37     def process_item(self, item, spider):
38         query = self.dbpool.runInteraction(self._conditional_insert, item)
39         query.addErrback(self.handle_error)
40         return item
41 
42     def _conditional_insert(self, tb, item):
43         tb.execute(""" insert into imooc_info2 (title,content,level,sums) values (%s,%s,%s,%s)""",(item[Course_name],item[Course_content],item[Course_level],item[Course_attendance]))
44         log.msg("Item data in db: %s" % item, level=log.DEBUG)
45 
46     def handle_error(self, e):
47         log.err(e)

3.修改setting.py

加入MySQL配置,添加pipelines.py 內新建類

 1 # start MySQL database configure setting
 2 MYSQL_HOST = localhost
 3 MYSQL_DBNAME = imooc
 4 MYSQL_USER = root
 5 MYSQL_PASSWD = hwfx1234
 6 # end of MySQL database configure setting
 7 ITEM_PIPELINES = {
 8             imooc.pipelines.ImoocPipeline: 300,
 9             imooc.pipelines.MySQLPipeline: 300,
10                 } 

4.開始爬蟲

1 scrapy crawl imooc

查看數據庫表數據,數據已經入庫。

技術分享

總結:scrapy簡單的應用,還沒考慮反爬蟲、分布式等問題,還需要多練習。

Scrapy爬取慕課網(imooc)所有課程數據並存入MySQL數據庫