鍍金池/ 問答/人工智能  數(shù)據(jù)分析&挖掘  網(wǎng)絡(luò)安全/ scrapy爬取圖片,遇到https://demo?wx_fmt=jpeg情況,

scrapy爬取圖片,遇到https://demo?wx_fmt=jpeg情況,無法爬取

原連接:https://mmbiz.qlogo.cn/mmbiz/...
使用的是scrapy的ImagesPipeline

class ImgPipeline(ImagesPipeline):
    """
    scrapy圖片處理管道
    """

    # 請求圖片
    def get_media_requests(self, item, info):
        content = str(item['content'])
        match = re.findall(r'src="(http|https?://.*?)"', content)
        item['img_links'] = match
        for img_link in item['img_links']:
            yield scrapy.Request(img_link)

    # 請求完成后
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no image")
        item['img_paths'] = image_paths
        return item

異常

2017-12-22 10:06:47 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://mmbiz.qpic.cn/mmbiz/AWbBdRJFaKQ4vb5qV2Nyc41VAuLmiaqePia7hI0uMlE3KRbZEOsaB4jAPdibnzBAmKp1aCiateeXGXoicsAfMugCVog/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1> referred in <None>
Traceback (most recent call last):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "C:\Users\zjx\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://mmbiz.qpic.cn/mmbiz/AWbBdRJFaKQ4vb5qV2Nyc41VAuLmiaqePia7hI0uMlE3KRbZEOsaB4jAPdibnzBAmKp1aCiateeXGXoicsAfMugCVog/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "C:\Users\zjx\Anaconda3\lib\site-packages\scrapy\pipelines\images.py", line 115, in get_images
    orig_image = Image.open(BytesIO(response.body))
  File "C:\Users\zjx\Anaconda3\lib\site-packages\PIL\Image.py", line 2519, in open
    % (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x000001842C76EFC0>

目前分析問題出現(xiàn)的原因是,該鏈接返回的是圖片的base64,scrapy不能識別

回答
編輯回答
故林

同學(xué)這個問題你解決了嗎,我也遇到了相同的問題(這個網(wǎng)站居然24小時過后才能私信TAT)

2017年8月3日 07:34