Python之scrapy linkextractors使用錯誤
阿新 • • 發佈:2019-01-24
self 運行 tee 技術 .py framework contain lib elf
1.環境及版本
python3.7.1+scrapy1.5.1
2.問題及錯誤代碼詳情
優先貼上問題代碼,如下:
import scrapy from scrapy.linkextractors import LinkExtractor class MatExamplesSpider(scrapy.Spider): name = ‘mat_examples‘ # allowed_domains = [‘matplotlib.org‘] start_urls = [‘https://matplotlib.org/gallery/index.html‘] def parse(self, response): le= LinkExtractor(restrict_xpaths=‘//a[contains(@class, "reference internal")]/@href‘) links = le.extract_links(response) print(response.url) print(type(links)) print(links)
運行代碼後報錯如下:
Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/eric.luo/Desktop/Python/matplotlib_examples/matplotlib_examples/spiders/mat_examples.py", line 14, in parse links = le.extract_links(response) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 128, in extract_links links = self._extract_links(doc, response.url, response.encoding, base_url) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/__init__.py", line 109, in _extract_links return self.link_extractor._extract_links(*args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 58, in _extract_links for el, attr, attr_val in self._iter_links(selector.root): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 46, in _iter_links for el in document.iter(etree.Element): AttributeError: ‘str‘ object has no attribute ‘iter‘
出現錯誤後自檢代碼並未發現問題,上網查找也未發現相關的問題;於是將代碼改成(restrict_css)去抓取數據,發現是能正常獲取到數據的,於是改回xpath;但這次先不使用linkextractor,采用scrapy自帶的response.xpath()方法去獲取對應鏈接所在標簽的href屬性值;發現這樣是可以獲取到正常的數據的:
即將:
le = LinkExtractor(restrict_xpaths=‘//a[contains(@class, "reference internal")]/@href‘) links = le.extract_links(response)
改成:
links = respon.xpath(‘//a[contains(@class, "reference internal")]/@href‘).extract()
然後又發現報錯是: ‘str‘ object has no attribute ‘iter‘
而正常返回的links數據類型應該是list才對,不應該是str,所以猜測可能是由於規則寫錯了導致獲取的數據不是list而變成了一個不知道的str;這樣針對性的去修改restrict_xpaths中的規則,最後發現去掉/@href後能夠獲取我所需要的正常的數據;
即將:
le = LinkExtractor(restrict_xpaths=‘//a[contains(@class, "reference internal")]/@href‘)
改成:
le = LinkExtractor(restrict_xpaths=‘//a[contains(@class, "reference internal")]‘)
重新運行代碼,發現成功獲取數據,輸出結果如下截圖所示:
*****爬蟲初學者,不喜勿噴*****
Python之scrapy linkextractors使用錯誤