如果你还没有安装好Scrapy，可以看安装教程

工具和环境：

语言：python 3.7
IDE：Pycharm
浏览器： Chrome
Scrapy：1.6.0

目标：使用scrapy爬取豆瓣图书Top250的信息

步骤：

创建一个Scrapy项目
定义提取的Item
编写爬取网站的 spider 并提取 Item
编写 Item Pipeline 来存储提取到的Item(即数据)

首先我们进入项目文件夹内输入命令

1	scrapy startproject doubanSpider

你应该可以看到生成的文件夹doubanSpider

Item

Item是保存爬取到的数据的容器，类似数据库的列。

我们编辑doubanSpider目录中的items.py

import scrapy


class DoubanBookItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 排名
    ranking = scrapy.Field()
    # 书名
    book_name = scrapy.Field()
    # 评分
    score = scrapy.Field()
    # 评价人数
    score_num = scrapy.Field()
    # 简介
    des = scrapy.Field()

编写第一个爬虫

spider负责爬取数据

你必须继承scrapy.Spider,并且定义一些属性

name: 用于区别Spider。该名字必须是唯一的，您不可以为不同的Spider设定相同的名字。

start_urls: 包含了Spider在启动时进行爬取的url列表。因此，第一个被获取到的页面将是其中之一。后续的URL则从初始的URL获取到的数据中提取。

parse() 是spider的一个方法。被调用时，每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(response data)，提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。

import scrapy
from doubanSpider.items import DoubanBookItem

class doubanSpider(scrapy.Spider):
    name = "douban_top250_book"
    start_urls = ['https://book.douban.com/top250']

    def parse(self, response):
        item = DoubanBookItem()

这样我们的基本结构就完成了，下面需要使用xpath提取网页上的信息，你可以查看这里了解xpath

现在我们打开 https://book.douban.com/top250 ，按F12查看网页源码

可以发现

<div class="indent">
        <p class="ulfirst"></p>
      <table width="100%">
        <tbody><tr class="item">
          <td width="100" valign="top">
            <a class="nbg" href="https://book.douban.com/subject/1770782/" onclick="moreurl(this,{i:'0'})">
              <img src="https://img3.doubanio.com/view/subject/m/public/s1727290.jpg" width="90">
            </a>
          </td>
          <td valign="top">       
            <div class="pl2">
              <a href="https://book.douban.com/subject/1770782/" onclick="&quot;moreurl(this,{i:'0'})&quot;" title="追风筝的人">
                追风筝的人

                
              </a>
                &nbsp; <img src="https://img3.doubanio.com/pics/read.gif" alt="可试读" title="可试读">
                <br>
                <span style="font-size:12px;">The Kite Runner</span>
            </div>
              <p class="pl">[美] 卡勒德·胡赛尼 / 李继宏 / 上海人民出版社 / 2006-5 / 29.00元</p>

              <div class="star clearfix">
                  <span class="allstar45"></span>
                  <span class="rating_nums">8.9</span>
                <span class="pl">(
                    420384人评价
                )</span>
              </div>
              <p class="quote" style="margin: 10px 0; color: #666">
                  <span class="inq">为你，千千万万遍</span>
              </p>
          </td>
        </tr>
      </tbody></table>
        <p class="ul"></p>
        ......
        ......

Tips: 你可以Ctrl+Shift+C之后用鼠标直接点击网页元素

这样就可以编写xpath提取信息了

class doubanSpider(scrapy.Spider):
    name = "douban_top250_book"
    start_urls = ['https://book.douban.com/top250']
    rank = 0

    def parse(self, response):
        item = DoubanBookItem()
        books = response.xpath('//div[@class="indent"]//tr[@class="item"]')

        for book in books:
            self.rank += 1
            # 没有在网页中找到排名，手动计数
            item['ranking'] = self.rank
            item['book_name'] = book.xpath('.//div[@class="pl2"]/a/@title').extract_first()
            item['score'] = book.xpath('.//span[@class="rating_nums"]/text()').extract_first()
            item['score_num'] = book.xpath('.//div[@class="star clearfix"]/span[@class="pl"]/text()').re_first('(\d+)人评价')
            item['des'] = book.xpath('.//p[@class="quote"]/span/text()').extract_first()
            yield item

tips:你可以在chrome中使用 xpath helper 辅助判断你写的xpath是否正确，或者直接获取自动生成的xpath表达式

在终端使用命令

1	scrapy crawl douban_top250_book -o douban.json

-o douban.json表示将会把结果输出到douban.json文件中

查看douban.json文件，发现什么都没有，查看scrapy刚刚打印的日志，发现

1 2	[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://book.douban.com/top250>: HTTP status code is not handled or not allowed

原来豆瓣设置了一些反爬虫机制，我们的爬虫被403了。但我们也有应付办法，加一个headers，模拟正常浏览器访问

class doubanSpider(scrapy.Spider):
    name = "douban_top250_book"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
    }

    def start_requests(self):
        url = 'https://book.douban.com/top250'
        yield Request(url, headers=self.headers)

    def parse(self, response):
        item = DoubanBookItem()
        books = response.xpath('//div[@class="indent"]//tr[@class="item"]')

        for book in books:
            item['ranking'] = book.xpath('./td[1]/a/@onclick').re_first('\'(\d+)\'')
            item['book_name'] = book.xpath('.//div[@class="pl2"]/a/@title').extract_first()
            item['score'] = book.xpath('.//span[@class="rating_nums"]/text()').extract_first()
            item['score_num'] = book.xpath('.//div[@class="star clearfix"]/span[@class="pl"]/text()').re_first('(\d+)人评价')
            item['des'] = book.xpath('.//p[@class="quote"]/span/text()').extract_first()
            yield item

现在再次输入命令

1	scrapy crawl douban_top250_book -o douban.json

可以发现生成了douban.json并且里面已经有了数据，不过中文都是Unicode字符。现在我们打开setting.py文件，增加一条设置

1	FEED_EXPORT_ENCODING = 'utf-8'

再运行一下，一切正常了！

追踪链接

现在我们只抓取到这一页的25条信息，现在我们想要全部的250条怎么办？聪明的你可能已经想到了：提取网页中”后页“的链接并爬取。

再打开开发者工具查看

<span class="next">
    <link rel="next" href="https://book.douban.com/top250?start=50">
    <a href="https://book.douban.com/top250?start=50">后页&gt;</a>
</span>

修改我们的python代码

def parse(self, response):
       item = DoubanBookItem()
       books = response.xpath('//div[@class="indent"]//tr[@class="item"]')

       for book in books:
           item['ranking'] = book.xpath('./td[1]/a/@onclick').re_first('\'(\d+)\'')
           item['book_name'] = book.xpath('.//div[@class="pl2"]/a/@title').extract_first()
           item['score'] = book.xpath('.//span[@class="rating_nums"]/text()').extract_first()
           item['score_num'] = book.xpath('.//div[@class="star clearfix"]/span[@class="pl"]/text()').re_first('(\d+)人评价')
           item['des'] = book.xpath('.//p[@class="quote"]/span/text()').extract_first()
           yield item

       next_url = response.xpath('//span[@class="next"]/a/@href').extract_first()
       if next_url:
           yield Request(next_url, headers=self.headers)

再运行一下会发现已经爬取了全部的250条数据

pipeline

当我们需要对数据做更多操作时，比如

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

我们需要使用pipeline。

先在setting.py文件中打开pipeline

1
2
3

ITEM_PIPELINES = {
   'doubanSpider.pipelines.DoubanspiderPipeline': 300,
}

后面的数字代表优先级，即多个pipeline时数据先通过哪个pipeline

现在在pipeline中实现process_item()

import json

class DoubanspiderPipeline(object):
    def __init__(self):
        self.file = open('doubanTop250book.json', 'w', encoding='utf-8')
        
    def process_item(self, item, spider):
        if float(item['score']) < 8.5:
            raise DropItem("Missing score in %s" % item)
        else:
            line = json.dumps(dict(item), ensure_ascii=False) + ",\n"
            self.file.write(line)
            return item
    
    def close_spider(self, spider):
        self.file.close()

在这里我们将评分小于8.5的书籍全部丢弃

现在再来运行一下

1	scrapy crawl douban_top250_book

你应该可以发现多了doubanTop250book.json文件，并且里面书籍的评分都是大于等于8.5的

调试

怎么用pycharm给scrapy爬虫打断点调试？你可以新建一个begin.py文件

1
2
3

from scrapy import cmdline

cmdline.execute("scrapy crawl douban_top250_book".split())

在pycharm的启动配置中修改入口文件为begin.py，这样你就可以愉快地像正常项目一样调试了

源码地址 https://github.com/UUNNFLY/doubanSpider

参考链接

Scrapy入门教程
 Scrapy爬虫框架教程（二）— 爬取豆瓣电影TOP250

Item
编写第一个爬虫
追踪链接
pipeline
调试

scrapy爬虫实战

Item

编写第一个爬虫

追踪链接

pipeline

调试

FRIENDS