1. 程式人生 > >python之scrapy(二)選擇器的使用

python之scrapy(二)選擇器的使用


 {
"cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selector的用法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "前面介紹過利用BeautifulSoup和PyQuery以及正則表示式來提取網頁資料,非常方便。而Scrapy也有自己的提取資料的方法,即Selector選擇器。Select是基於lxml來構建的,支援XPath選擇器、CSS選擇器以及正則表示式,功能齊全,解析速度和準確度非常高。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本來可以採用scrapy shell來除錯選擇器的使用方法。也可以直接使用Selector模組直接模擬。官網也提供了相應的方法:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/selectors.html 。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "關於XPath的語法和運算子可以參考:http://www.runoob.com/xpath/xpath-tutorial.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. 兩種選擇器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由於在response中使用XPath、CSS查詢十分普遍,因此,Scrapy提供了兩個實用的快捷方式: response.xpath() 及 response.css():"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<Selector xpath='//title/text()' data='Example website'>]\n",
      "[<Selector xpath='descendant-or-self::title/text()' data='Example website'>]\n"
     ]
    }
   ],
   "source": [
    "from scrapy import Selector\n",
    "html='''\n",
    "<html>\n",
    " <head>\n",
    "  <base href='http://example.com/' />\n",
    "  <title>Example website</title>\n",
    " </head>\n",
    " <body>\n",
    "  <div id='images'>\n",
    "   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
    "   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
    "   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
    "   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
    "   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
    "  </div>\n",
    " </body>\n",
    "</html>\n",
    "'''\n",
    "selector = Selector(text=html)\n",
    "print(selector.xpath('//title/text()'))\n",
    "print(selector.css('title::text'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如你所見, .xpath() 及 .css() 方法返回一個類 SelectorList 的例項, 它是一個新選擇器的列表。這個API可以用來快速的提取巢狀資料。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1 提取資料"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "為了提取真實的原文資料,你需要呼叫 .extract()和extract_first() 方法如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Example website']\n",
      "Example website\n"
     ]
    }
   ],
   "source": [
    "print(selector.xpath('//title/text()').extract())\n",
    "print(selector.css('title::text').extract_first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "extract()返回的是一個列表,extract_first()返回單個值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 根據屬性來提取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scrapy import Selector\n",
    "html='''\n",
    "<html>\n",
    " <head>\n",
    "  <base href='http://example.com/' />\n",
    "  <title>Example website</title>\n",
    " </head>\n",
    " <body>\n",
    "  <div id='images'>\n",
    "   <a href='123image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
    "   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
    "   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
    "   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
    "   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
    "  </div>\n",
    " </body>\n",
    "</html>\n",
    "'''\n",
    "selector = Selector(text=html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "現在我們想獲取base標籤上的連結:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http://example.com/\n",
      "http://example.com/\n"
     ]
    }
   ],
   "source": [
    "print(selector.xpath('//base/@href').extract_first())\n",
    "print(selector.css('base::attr(\"href\")').extract_first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "獲取圖片連結:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n",
      "['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n"
     ]
    }
   ],
   "source": [
    "print(selector.xpath('//div[@id=\"images\"]/a/@href').extract())\n",
    "print(selector.css('#images a::attr(\"href\")').extract())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['123image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n"
     ]
    }
   ],
   "source": [
    "#print(selector.xpath('//a[contains(@href, \"image\")]/@href').extract())\n",
    "print(selector.css('a[href*=image]::attr(\"href\")').extract())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "獲取img標籤內的src內容:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']\n",
      "['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']\n"
     ]
    }
   ],
   "source": [
    "print(selector.xpath('//a[contains(@href, \"image\")]/img/@src').extract())\n",
    "print(selector.css('a[href*=image] img::attr(src)').extract())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.巢狀選擇器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "選擇器方法( .xpath() 或 .css() )返回相同型別的選擇器列表,因此你也可以對這些選擇器呼叫選擇器方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scrapy import Selector\n",
    "html='''\n",
    "<html>\n",
    " <head>\n",
    "  <base href='http://example.com/' />\n",
    "  <title>Example website</title>\n",
    " </head>\n",
    " <body>\n",
    "  <div id='images'>\n",
    "   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
    "   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
    "   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
    "   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
    "   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
    "  </div>\n",
    " </body>\n",
    "</html>\n",
    "'''\n",
    "selector = Selector(text=html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"123image1.html\">Name: My image '>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image2.html\">Name: My image 2 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image3.html\">Name: My image 3 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image4.html\">Name: My image 4 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image5.html\">Name: My image 5 <'>]\n",
      "連結0指向url是:123image1.html  和圖是:image1_thumb.jpg\n",
      "連結1指向url是:image2.html  和圖是:image2_thumb.jpg\n",
      "連結2指向url是:image3.html  和圖是:image3_thumb.jpg\n",
      "連結3指向url是:image4.html  和圖是:image4_thumb.jpg\n",
      "連結4指向url是:image5.html  和圖是:image5_thumb.jpg\n"
     ]
    }
   ],
   "source": [
    "links = selector.xpath('//a[contains(@href, \"image\")]')\n",
    "print(links)\n",
    "for index, link in enumerate(links):\n",
    "    args = (index, link.css('::attr(href)').extract_first(), link.xpath('img/@src').extract_first())\n",
    "    print ('連結%d指向url是:%s  和圖是:%s' % args)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 結合正則表示式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selector 也有一個 .re() 方法,用來通過正則表示式來提取資料。然而,不同於使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字串的列表。所以你無法構造巢狀式的 .re() 呼叫。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'My image 1 '"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selector.xpath('//a[contains(@href, \"image\")]/text()').extract()\n",
    "selector.xpath('//a[contains(@href, \"image\")]/text()').re_first(r'Name:\\s(.*)')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}