Scrapy
  • Scrapy at a glance
    • Walk-through of an example spider
      • What just happened?
    • What else?
    • What’s next?
  • Installation guide
    • Installing Scrapy
    • Platform specific installation notes
      • Windows
      • Ubuntu 9.10 or above
      • Archlinux
  • Scrapy Tutorial
    • Creating a project
    • Defining our Item
    • Our first Spider
      • Crawling
        • What just happened under the hood?
      • Extracting Items
        • Introduction to Selectors
        • Trying Selectors in the Shell
        • Extracting the data
      • Using our item
    • Following links
    • Storing the scraped data
    • Next steps
  • Examples
  • Command line tool
    • Configuration settings
    • Default structure of Scrapy projects
    • Using the scrapy tool
      • Creating projects
      • Controlling projects
    • Available tool commands
      • startproject
      • genspider
      • crawl
      • check
      • list
      • edit
      • fetch
      • view
      • shell
      • parse
      • settings
      • runspider
      • version
      • bench
    • Custom project commands
      • COMMANDS_MODULE
      • Register commands via setup.py entry points
  • Spiders
    • scrapy.Spider
    • Spider arguments
    • Generic Spiders
      • CrawlSpider
        • Crawling rules
        • CrawlSpider example
      • XMLFeedSpider
        • XMLFeedSpider example
      • CSVFeedSpider
        • CSVFeedSpider example
      • SitemapSpider
        • SitemapSpider examples
  • Selectors
    • Using selectors
      • Constructing selectors
      • Using selectors
      • Nesting selectors
      • Using selectors with regular expressions
      • Working with relative XPaths
      • Using EXSLT extensions
        • Regular expressions
        • Set operations
      • Some XPath tips
        • Using text nodes in a condition
        • Beware of the difference between //node[1] and (//node)[1]
        • When querying by class, consider using CSS
    • Built-in Selectors reference
      • SelectorList objects
        • Selector examples on HTML response
        • Selector examples on XML response
        • Removing namespaces
  • Items
    • Declaring Items
    • Item Fields
    • Working with Items
      • Creating items
      • Getting field values
      • Setting field values
      • Accessing all populated values
      • Other common tasks
    • Extending Items
    • Item objects
    • Field objects
  • Item Loaders
    • Using Item Loaders to populate items
    • Input and Output processors
    • Declaring Item Loaders
    • Declaring Input and Output Processors
    • Item Loader Context
    • ItemLoader objects
    • Reusing and extending Item Loaders
    • Available built-in processors
  • Scrapy shell
    • Launch the shell
    • Using the shell
      • Available Shortcuts
      • Available Scrapy objects
    • Example of shell session
    • Invoking the shell from spiders to inspect responses
  • Item Pipeline
    • Writing your own item pipeline
    • Item pipeline example
      • Price validation and dropping items with no prices
      • Write items to a JSON file
      • Write items to MongoDB
      • Duplicates filter
    • Activating an Item Pipeline component
  • Feed exports
    • Serialization formats
      • JSON
      • JSON lines
      • CSV
      • XML
      • Pickle
      • Marshal
    • Storages
    • Storage URI parameters
    • Storage backends
      • Local filesystem
      • FTP
      • S3
      • Standard output
    • Settings
      • FEED_URI
      • FEED_FORMAT
      • FEED_EXPORT_FIELDS
      • FEED_STORE_EMPTY
      • FEED_STORAGES
      • FEED_STORAGES_BASE
      • FEED_EXPORTERS
      • FEED_EXPORTERS_BASE
  • Requests and Responses
    • Request objects
      • Passing additional data to callback functions
    • Request.meta special keys
      • bindaddress
      • download_timeout
    • Request subclasses
      • FormRequest objects
      • Request usage examples
        • Using FormRequest to send data via HTTP POST
        • Using FormRequest.from_response() to simulate a user login
    • Response objects
    • Response subclasses
      • TextResponse objects
      • HtmlResponse objects
      • XmlResponse objects
  • Link Extractors
    • Built-in link extractors reference
      • LxmlLinkExtractor
  • Settings
    • Designating the settings
    • Populating the settings
      • 1. Command line options
      • 2. Settings per-spider
      • 3. Project settings module
      • 4. Default settings per-command
      • 5. Default global settings
    • How to access settings
    • Rationale for setting names
    • Built-in settings reference
      • AWS_ACCESS_KEY_ID
      • AWS_SECRET_ACCESS_KEY
      • BOT_NAME
      • CONCURRENT_ITEMS
      • CONCURRENT_REQUESTS
      • CONCURRENT_REQUESTS_PER_DOMAIN
      • CONCURRENT_REQUESTS_PER_IP
      • DEFAULT_ITEM_CLASS
      • DEFAULT_REQUEST_HEADERS
      • DEPTH_LIMIT
      • DEPTH_PRIORITY
      • DEPTH_STATS
      • DEPTH_STATS_VERBOSE
      • DNSCACHE_ENABLED
      • DNSCACHE_SIZE
      • DNS_TIMEOUT
      • DOWNLOADER
      • DOWNLOADER_MIDDLEWARES
      • DOWNLOADER_MIDDLEWARES_BASE
      • DOWNLOADER_STATS
      • DOWNLOAD_DELAY
      • DOWNLOAD_HANDLERS
      • DOWNLOAD_HANDLERS_BASE
      • DOWNLOAD_TIMEOUT
      • DOWNLOAD_MAXSIZE
      • DOWNLOAD_WARNSIZE
      • DUPEFILTER_CLASS
      • DUPEFILTER_DEBUG
      • EDITOR
      • EXTENSIONS
      • EXTENSIONS_BASE
      • ITEM_PIPELINES
      • ITEM_PIPELINES_BASE
      • LOG_ENABLED
      • LOG_ENCODING
      • LOG_FILE
      • LOG_FORMAT
      • LOG_DATEFORMAT
      • LOG_LEVEL
      • LOG_STDOUT
      • MEMDEBUG_ENABLED
      • MEMDEBUG_NOTIFY
      • MEMUSAGE_ENABLED
      • MEMUSAGE_LIMIT_MB
      • MEMUSAGE_NOTIFY_MAIL
      • MEMUSAGE_REPORT
      • MEMUSAGE_WARNING_MB
      • NEWSPIDER_MODULE
      • RANDOMIZE_DOWNLOAD_DELAY
      • REACTOR_THREADPOOL_MAXSIZE
      • REDIRECT_MAX_TIMES
      • REDIRECT_MAX_METAREFRESH_DELAY
      • REDIRECT_PRIORITY_ADJUST
      • ROBOTSTXT_OBEY
      • SCHEDULER
      • SPIDER_CONTRACTS
      • SPIDER_CONTRACTS_BASE
      • SPIDER_LOADER_CLASS
      • SPIDER_MIDDLEWARES
      • SPIDER_MIDDLEWARES_BASE
      • SPIDER_MODULES
      • STATS_CLASS
      • STATS_DUMP
      • STATSMAILER_RCPTS
      • TELNETCONSOLE_ENABLED
      • TELNETCONSOLE_PORT
      • TEMPLATES_DIR
      • URLLENGTH_LIMIT
      • USER_AGENT
      • Settings documented elsewhere:
  • Exceptions
    • Built-in Exceptions reference
      • DropItem
      • CloseSpider
      • IgnoreRequest
      • NotConfigured
      • NotSupported
  • Logging
    • Log levels
    • How to log messages
    • Logging from Spiders
    • Logging configuration
      • Logging settings
      • Command-line options
    • scrapy.utils.log module
  • Stats Collection
    • Common Stats Collector uses
    • Available Stats Collectors
      • MemoryStatsCollector
      • DummyStatsCollector
  • Sending e-mail
    • Quick example
    • MailSender class reference
    • Mail settings
      • MAIL_FROM
      • MAIL_HOST
      • MAIL_PORT
      • MAIL_USER
      • MAIL_PASS
      • MAIL_TLS
      • MAIL_SSL
  • Telnet Console
    • How to access the telnet console
    • Available variables in the telnet console
    • Telnet console usage examples
      • View engine status
      • Pause, resume and stop the Scrapy engine
    • Telnet Console signals
    • Telnet settings
      • TELNETCONSOLE_PORT
      • TELNETCONSOLE_HOST
  • Web Service
  • Frequently Asked Questions
    • How does Scrapy compare to BeautifulSoup or lxml?
    • What Python versions does Scrapy support?
    • Does Scrapy work with Python 3?
    • Did Scrapy “steal” X from Django?
    • Does Scrapy work with HTTP proxies?
    • How can I scrape an item with attributes in different pages?
    • Scrapy crashes with: ImportError: No module named win32api
    • How can I simulate a user login in my spider?
    • Does Scrapy crawl in breadth-first or depth-first order?
    • My Scrapy crawler has memory leaks. What can I do?
    • How can I make Scrapy consume less memory?
    • Can I use Basic HTTP Authentication in my spiders?
    • Why does Scrapy download pages in English instead of my native language?
    • Where can I find some example Scrapy projects?
    • Can I run a spider without creating a project?
    • I get “Filtered offsite request” messages. How can I fix them?
    • What is the recommended way to deploy a Scrapy crawler in production?
    • Can I use JSON for large exports?
    • Can I return (Twisted) deferreds from signal handlers?
    • What does the response status code 999 means?
    • Can I call pdb.set_trace() from my spiders to debug them?
    • Simplest way to dump all my scraped items into a JSON/CSV/XML file?
    • What’s this huge cryptic __VIEWSTATE parameter used in some forms?
    • What’s the best way to parse big XML/CSV data feeds?
    • Does Scrapy manage cookies automatically?
    • How can I see the cookies being sent and received from Scrapy?
    • How can I instruct a spider to stop itself?
    • How can I prevent my Scrapy bot from getting banned?
    • Should I use spider arguments or settings to configure my spider?
    • I’m scraping a XML document and my XPath selector doesn’t return any items
  • Debugging Spiders
    • Parse Command
    • Scrapy Shell
    • Open in browser
    • Logging
  • Spiders Contracts
    • Custom Contracts
  • Common Practices
    • Run Scrapy from a script
    • Running multiple spiders in the same process
    • Distributed crawls
    • Avoiding getting banned
  • Broad Crawls
    • Increase concurrency
    • Increase Twisted IO thread pool maximum size
    • Setup your own DNS
    • Reduce log level
    • Disable cookies
    • Disable retries
    • Reduce download timeout
    • Disable redirects
    • Enable crawling of “Ajax Crawlable Pages”
  • Using Firefox for scraping
    • Caveats with inspecting the live browser DOM
    • Useful Firefox add-ons for scraping
      • Firebug
      • XPather
      • XPath Checker
      • Tamper Data
      • Firecookie
  • Using Firebug for scraping
    • Introduction
    • Getting links to follow
    • Extracting the data
  • Debugging memory leaks
    • Common causes of memory leaks
      • Too Many Requests?
    • Debugging memory leaks with trackref
      • Which objects are tracked?
      • A real example
      • Too many spiders?
      • scrapy.utils.trackref module
    • Debugging memory leaks with Guppy
    • Leaks without leaks
  • Downloading and processing files and images
    • Using the Files Pipeline
    • Using the Images Pipeline
    • Usage example
    • Enabling your Media Pipeline
    • Supported Storage
      • File system storage
    • Additional features
      • File expiration
      • Thumbnail generation for images
      • Filtering out small images
    • Extending the Media Pipelines
    • Custom Images pipeline example
  • Ubuntu packages
  • Deploying Spiders
    • Deploying to a Scrapyd Server
    • Deploying to Scrapy Cloud
  • AutoThrottle extension
    • Design goals
    • How it works
    • Throttling algorithm
    • Settings
      • AUTOTHROTTLE_ENABLED
      • AUTOTHROTTLE_START_DELAY
      • AUTOTHROTTLE_MAX_DELAY
      • AUTOTHROTTLE_DEBUG
  • Benchmarking
  • Jobs: pausing and resuming crawls
    • Job directory
    • How to use it
    • Keeping persistent state between batches
    • Persistence gotchas
      • Cookies expiration
      • Request serialization
  • Architecture overview
    • Overview
    • Components
      • Scrapy Engine
      • Scheduler
      • Downloader
      • Spiders
      • Item Pipeline
      • Downloader middlewares
      • Spider middlewares
    • Data flow
    • Event-driven networking
  • Downloader Middleware
    • Activating a downloader middleware
    • Writing your own downloader middleware
    • Built-in downloader middleware reference
      • CookiesMiddleware
        • Multiple cookie sessions per spider
        • COOKIES_ENABLED
        • COOKIES_DEBUG
      • DefaultHeadersMiddleware
      • DownloadTimeoutMiddleware
      • HttpAuthMiddleware
      • HttpCacheMiddleware
        • Dummy policy (default)
        • RFC2616 policy
        • Filesystem storage backend (default)
        • DBM storage backend
        • LevelDB storage backend
        • HTTPCache middleware settings
      • HttpCompressionMiddleware
        • HttpCompressionMiddleware Settings
      • ChunkedTransferMiddleware
      • HttpProxyMiddleware
      • RedirectMiddleware
        • RedirectMiddleware settings
      • MetaRefreshMiddleware
        • MetaRefreshMiddleware settings
      • RetryMiddleware
        • RetryMiddleware Settings
      • RobotsTxtMiddleware
      • DownloaderStats
      • UserAgentMiddleware
      • AjaxCrawlMiddleware
        • AjaxCrawlMiddleware Settings
  • Spider Middleware
    • Activating a spider middleware
    • Writing your own spider middleware
    • Built-in spider middleware reference
      • DepthMiddleware
      • HttpErrorMiddleware
        • HttpErrorMiddleware settings
      • OffsiteMiddleware
      • RefererMiddleware
        • RefererMiddleware settings
      • UrlLengthMiddleware
  • Extensions
    • Extension settings
    • Loading & activating extensions
    • Available, enabled and disabled extensions
    • Disabling an extension
    • Writing your own extension
      • Sample extension
    • Built-in extensions reference
      • General purpose extensions
        • Log Stats extension
        • Core Stats extension
        • Telnet console extension
        • Memory usage extension
        • Memory debugger extension
        • Close spider extension
        • StatsMailer extension
      • Debugging extensions
        • Stack trace dump extension
        • Debugger extension
  • Core API
    • Crawler API
    • Settings API
    • SpiderLoader API
    • Signals API
    • Stats Collector API
  • Signals
    • Deferred signal handlers
    • Built-in signals reference
      • engine_started
      • engine_stopped
      • item_scraped
      • item_dropped
      • spider_closed
      • spider_opened
      • spider_idle
      • spider_error
      • request_scheduled
      • request_dropped
      • response_received
      • response_downloaded
  • Item Exporters
    • Using Item Exporters
    • Serialization of item fields
      • 1. Declaring a serializer in the field
      • 2. Overriding the serialize_field() method
    • Built-in Item Exporters reference
      • BaseItemExporter
      • XmlItemExporter
      • CsvItemExporter
      • PickleItemExporter
      • PprintItemExporter
      • JsonItemExporter
      • JsonLinesItemExporter
  • Release notes
    • 1.0.2 (2015-08-06)
    • 1.0.1 (2015-07-01)
    • 1.0.0 (2015-06-19)
      • Support for returning dictionaries in spiders
      • Per-spider settings (GSoC 2014)
      • Python Logging
      • Crawler API refactoring (GSoC 2014)
      • Module Relocations
        • Full list of relocations
      • Changelog
    • 0.24.6 (2015-04-20)
    • 0.24.5 (2015-02-25)
    • 0.24.4 (2014-08-09)
    • 0.24.3 (2014-08-09)
    • 0.24.2 (2014-07-08)
    • 0.24.1 (2014-06-27)
    • 0.24.0 (2014-06-26)
      • Enhancements
      • Bugfixes
    • 0.22.2 (released 2014-02-14)
    • 0.22.1 (released 2014-02-08)
    • 0.22.0 (released 2014-01-17)
      • Enhancements
      • Fixes
    • 0.20.2 (released 2013-12-09)
    • 0.20.1 (released 2013-11-28)
    • 0.20.0 (released 2013-11-08)
      • Enhancements
      • Bugfixes
      • Other
      • Thanks
    • 0.18.4 (released 2013-10-10)
    • 0.18.3 (released 2013-10-03)
    • 0.18.2 (released 2013-09-03)
    • 0.18.1 (released 2013-08-27)
    • 0.18.0 (released 2013-08-09)
    • 0.16.5 (released 2013-05-30)
    • 0.16.4 (released 2013-01-23)
    • 0.16.3 (released 2012-12-07)
    • 0.16.2 (released 2012-11-09)
    • 0.16.1 (released 2012-10-26)
    • 0.16.0 (released 2012-10-18)
    • 0.14.4
    • 0.14.3
    • 0.14.2
    • 0.14.1
    • 0.14
      • New features and settings
      • Code rearranged and removed
    • 0.12
      • New features and improvements
      • Scrapyd changes
      • Changes to settings
      • Deprecated/obsoleted functionality
    • 0.10
      • New features and improvements
      • Command-line tool changes
      • API changes
      • Changes to settings
    • 0.9
      • New features and improvements
      • API changes
      • Changes to default settings
    • 0.8
      • New features
      • Backwards-incompatible changes
    • 0.7
  • Contributing to Scrapy
    • Reporting bugs
    • Writing patches
    • Submitting patches
    • Coding style
    • Scrapy Contrib
    • Documentation policies
    • Tests
      • Running tests
      • Writing tests
  • Versioning and API Stability
    • Versioning
    • API Stability
 
Scrapy
  • Docs »
  • Link Extractors
  • View page source

Link Extractors¶

Link extractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.

There is scrapy.linkextractors import LinkExtractor available in Scrapy, but you can create your own custom Link Extractors to suit your needs by implementing a simple interface.

The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.

Link extractors are used in the CrawlSpider class (available in Scrapy), through a set of rules, but you can also use it in your spiders, even if you don’t subclass from CrawlSpider, as its purpose is very simple: to extract links.

Built-in link extractors reference¶

Link extractors classes bundled with Scrapy are provided in the scrapy.linkextractors module.

The default link extractor is LinkExtractor, which is the same as LxmlLinkExtractor:

from scrapy.linkextractors import LinkExtractor

There used to be other link extractor classes in previous Scrapy versions, but they are deprecated now.

LxmlLinkExtractor¶

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)¶

LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml’s robust HTMLParser.

Parameters:
  • allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
  • deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
  • allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
  • deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
  • deny_extensions (list) – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractors package.
  • restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
  • restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as restrict_xpaths.
  • tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
  • attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
  • canonicalize (boolean) – canonicalize each extracted url (using scrapy.utils.url.canonicalize_url). Defaults to True.
  • unique (boolean) – whether duplicate filtering should be applied to extracted links.
  • process_value (callable) –

    a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.

    For example, to extract links from this code:

    <a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
    

    You can use the following function in process_value:

    def process_value(value):
        m = re.search("javascript:goToPage\('(.*?)'", value)
        if m:
            return m.group(1)
    
Next Previous

© Copyright 2008-2015, Scrapy developers. Last updated on February 26, 2016.

Built with Sphinx using a theme provided by Read the Docs.