Module scrapy_patterns

What is scrapy-patterns?

It's a library for Scrapy to help implementing spiders quicker. How? Many websites are built around patterns. The goal of this library is to provide elements for following those patterns. All you need to do is tell how to extract the necessary information, and the patterns in this library will do the rest (like following links, extracting items, etc…).

Concepts

Spiderlings

Spiderlings are "immature" spiders; they are not really meaningful on their own, only when combined with other spiderlings / spiders. They provide one functionality, like going through a list of pages from a given starting URL. Following is a description of currently existing spiderlings.

Site Pager

The SitePager class can be used for going through a pageable part of a website. It needs 3 user given objects to do its job:

  • NextPageUrlParser: Its responsibility is to checks whether the current page has a link for the next page, and how to extract it.

  • ItemUrlsParser: Should return the URL of items found on the page.

  • ItemParser: Should return a Scrapy Item from the URLs returned by ItemUrlsParser.

So to use SitePager, you implement the above mentioned 3 interfaces, wrap it in SitePageParsers, pass it to SitePager's constructor, and then call (yield) SitePager.start(), which will produce a request with which the scraping will continue. You can find an example for usage in CategoryBasedSpider.

Site Structure Discoverer

SiteStructureDiscoverer can read the hierarchy of a site. For example, a site may have main categories, each of them can have sub-categories and sub-categories could have further sub-categories, and so on. SiteStructureDiscoverer will parse this structure into a SiteStructure which is basically a tree, which then can be processed further. SiteStructureDiscoverer only needs information about how to extract the different levels of categories. This can be done through implementing a CategoryParser, for each level, and then pass a list of them to SiteStructureDiscoverer. The last element in the list should parse the leaf categories, which won't be processed further. This means, that if the site you want to scrape has only main categories, the list should contain one element only, if there are sub-categories, there should be two parsers, etc. Each parser get a response from the level above (the first element will get a starting URL response).
You can find an example for usage in CategoryBasedSpider.

Spiders

Category Based Spider

Combines SiteStructureDiscoverer and SitePager to scrape sites that are based on categories, sub-categories, sub-sub-categories, etc. and where leaf categories point to a pageable part of site from which items can be extracted.
This spider also keeps track of its state which is saved at regular checkpoints. Upon restarting the spider, if progress file exists scraping will continue (from the last saved page). This progress-saving mechanism has limitations compared to Scrapy's pausing, like it won't continue exactly where it left off, but it has the advantage that requests doesn't have to be serializable. Because of the nature of this mechanism, some URLs will be processed twice, resulting in possible duplicate items. You should keep this in mind when processing them. (Duplicate items could anyway occur since an item could belong to multiple categories.)
To use it inherit your spiders from it similarly how you inherit from Scrapy spiders, but also providing a starting URL, and rest of the needed data. You don't need to call CategoryBasedSpider.start_requests() as it will be handled by Scrapy. When the spider starts, it'll check whether a progress file exists, and if yes it will continue based on it. Otherwise it starts site structure discovering.

Expand source code
"""
.. include:: ./documentation.md
"""

Sub-modules

scrapy_patterns.request_factory

Contains the default request factory

scrapy_patterns.site_structure

Contains classes that are used to describe the structure of a site.

scrapy_patterns.spiderlings

Contains spiderlings.

scrapy_patterns.spiders

Contains spiders.