1. 程式人生 > >Python之scrapy linkextractors使用錯誤

Python之scrapy linkextractors使用錯誤

self 運行 tee 技術 .py framework contain lib elf

1.環境及版本

python3.7.1+scrapy1.5.1

2.問題及錯誤代碼詳情

優先貼上問題代碼,如下:

import scrapy
from scrapy.linkextractors import LinkExtractor


class MatExamplesSpider(scrapy.Spider):
    name = mat_examples
    # allowed_domains = [‘matplotlib.org‘]
    start_urls = [https://matplotlib.org/gallery/index.html]


    def parse(self, response):
        le 
= LinkExtractor(restrict_xpaths=//a[contains(@class, "reference internal")]/@href) links = le.extract_links(response) print(response.url) print(type(links)) print(links)

運行代碼後報錯如下:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/eric.luo/Desktop/Python/matplotlib_examples/matplotlib_examples/spiders/mat_examples.py", line 14, in parse
    links = le.extract_links(response)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 128, in extract_links
    links = self._extract_links(doc, response.url, response.encoding, base_url)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/__init__.py", line 109, in _extract_links
    return self.link_extractor._extract_links(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 58, in _extract_links
    for el, attr, attr_val in self._iter_links(selector.root):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 46, in _iter_links
    for el in document.iter(etree.Element):
AttributeError: ‘str‘ object has no attribute ‘iter‘

技術分享圖片

出現錯誤後自檢代碼並未發現問題,上網查找也未發現相關的問題;於是將代碼改成(restrict_css)去抓取數據,發現是能正常獲取到數據的,於是改回xpath;但這次先不使用linkextractor,采用scrapy自帶的response.xpath()方法去獲取對應鏈接所在標簽的href屬性值;發現這樣是可以獲取到正常的數據的:

即將:

le = LinkExtractor(restrict_xpaths=//a[contains(@class, "reference internal")]/@href)
links = le.extract_links(response) 

改成:

links = respon.xpath(‘//a[contains(@class, "reference internal")]/@href).extract()

然後又發現報錯是: ‘str‘ object has no attribute ‘iter‘

而正常返回的links數據類型應該是list才對,不應該是str,所以猜測可能是由於規則寫錯了導致獲取的數據不是list而變成了一個不知道的str;這樣針對性的去修改restrict_xpaths中的規則,最後發現去掉/@href後能夠獲取我所需要的正常的數據;

即將:

le = LinkExtractor(restrict_xpaths=//a[contains(@class, "reference internal")]/@href)

改成:

le = LinkExtractor(restrict_xpaths=//a[contains(@class, "reference internal")])

重新運行代碼,發現成功獲取數據,輸出結果如下截圖所示:

技術分享圖片

*****爬蟲初學者,不喜勿噴*****

  

Python之scrapy linkextractors使用錯誤