关于 scrapy 的 allowed_domains 失效问题

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3207 days ago, the information mentioned may be changed or developed.

RT，我这边爬虫类继承的是基础 spider：

from scrapy.spiders import Spider

class DSpider(Spider):

然后也设置了:

allowed_domains=["baidu.com"]

结果发现，爬虫仍然会递归爬到其他网站上的链接，我这边肯定不是链接跳转过去的，是直接爬到其他网站了。网上找了下原因，说我得去启用 OffsiteMiddleware，当然具体怎么设置值没找到。

class scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware

然后我看了下，我自己胡乱加了下配置：

SPIDER_MIDDLEWARES = {
    'DomainSpider.middlewares.MyCustomSpiderMiddleware': 543,
    'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': None,
}

结果报错：

exceptions.ImportError: No module named middlewares

似乎启用这个，我还得去建一个 middlewares.py ，然后往里面填充一些不必要的配置。感觉这样很麻烦，不知道我是否走错弯路了，还请大佬们指点一下～非常感激！！！

3 replies • 2017-08-26 10:04:54 +08:00

bytenoob

Aug 25, 2017

看文档，你这个填 None 就是不启用啊，貌似类的导入位置也不对

akmonde

Aug 25, 2017

@Yc1992
SPIDER_MIDDLEWARES = 那行写在了 setting 里面，我这儿没有引入 middlewares 是没有生效的，不知道 allowed_domains 需要在啥时候生效，我看别人的好像也没提要加 middlewares。

akmonde

Aug 26, 2017

真的没有大佬能解答一下？ so 尴尬。。我后来以为是版本原因，升级了 scrapy 还是这样，