In this post, we will answer the following questions: What are web crawlers? What is web scraping? Which python web crawler should you be using, Scrapy or BeautifulSoup?
This article is Scrapy vs BeautifulSoup comparison.
If you ever come across a scenario where you need to download data off the internet, you’ll need to use a Python Web Crawler. There are two good web crawlers in Python that can be used for this purpose, Scrapy and BeautifulSoup.
What are web crawlers? What is web scraping? Which python web crawler should you be using, Scrapy or BeautifulSoup? We’ll be answering all these questions here in this article.
Web Scraping and Web Crawlers
Web scraping is the act of extracting or “scraping” data from a web page. The general process is as follows. First, the targeted web page is “fetched” or downloaded. Next, the data is retrieved and parsed through into a suitable format. Finally, we get to navigate through the parsed data, selecting the data we want.
The Web scraping process is fully automated, done through a bot which we call the “Web Crawler”. Web Crawlers are created using appropriate software like Python, with the BeautifulSoup and Scrapy libraries.
BeautifulSoup vs Scrapy
BeautifulSoup is actually just a simple content parser. It can’t do much else, as it even requires the requests library to actually retrieve the web page for it to scrape. Scrapy, on the other hand, is an entire framework consisting of many libraries, as an all in one solution to web scraping. Scrapy can retrieve, parse and extract data from a web page all by itself.
By this point, you might be asking, why even learn BeautifulSoup? Scrapy is an excellent framework, but its learning curve is much steeper due to the large number of features, a harder setup, and complex navigation. BeautifulSoup is both easier to learn and use. Even someone who knows Scrapy well may use BeautifulSoup for simpler tasks.
The difference between the two is the same as the difference between a simple pistol and a Rifle with advanced gear attached. The pistol, due to its simplicity is easier and faster to use. On the other hand, the Rifle requires much more skill and training to use, but ultimately is much deadlier than the pistol.
Scrapy Features
It’s possible that some of the below tasks are possible with BeautifulSoup through alternate means, like using other libraries. However, the point here is that Scrapy has all these features built in to it, fully supported and compatible with its other features.
Improved Scraping
Built upon the Twisted, an asynchronous networking framework, Scrapy is also much faster than other web scrapers in terms of speed and memory usage.
Furthermore, it’s much more versatile and flexible. Websites often change their layout and structure over time. Scrapy is not effected by any minor changes in the website, and will continue to work normally.
Using other classes and settings like “Rules”, you can also adjust the behavior of the Scrapy Spider in many different ways.
Parallel Requests
Typically, web crawlers deal with one request at a time. Scrapy has the ability to run requests in parallel, allowing for much faster scraping.
In theory, if you could execute 60 requests in a minute, with 6 “concurrent” requests, you could get it done in 10 seconds. This isn’t always the case though due to overhead, latency and time taken to actually download the page.
Cookies and User Agents
By default, web crawlers will identify themselves as web crawlers to the browser/website they access. This can be quite a problem when you’re trying to get around the bot protection on certain websites.
With the use of User Agents, Cookies and Headers in Scrapy, you can fool the website into thinking that it’s an actual human attempting to access the site.
AutoThrottle
One of the major reasons why websites are able to detect Scrapy Spiders (or any spider in general) is due to how fast the Requests are made. Things just get even worse when your Scrapy Spider ends up slowing down the website due to the large number of requests in a short period of time.
To prevent this, Scrapy has the AutoThrottle option. Enabling this setting will cause Scrapy to automatically adjust the scraping speed of the spider depending on the traffic load on the target website.
This benefits us because our Spider becomes a lot less noticeable and the chances of getting IP banned decreases significantly. On the other hand, the website also benefits since the load is more evenly spread out instead of being concentrated at a single point.
Rate Limiting
The purpose of Rate or “Request” Limiting is the same as AutoThrottle, to increase the delay between requests to keep the spider off the website’s radar. There are all kinds of different settings which you can manipulate to achieve the desired result.
The difference between this setting and AutoThrottle is that Rate limiting involves using fixed delays, whereas AutoThrottle automatically adjusts the delay based off several factors.
Another bonus fact in Scrapy is that you can actually use both AutoThrottle and the Rate limiting settings together to create a more complex crawler that’s both fast and undetectable.
Proxies and VPNs
In case you need to send out a large number of requests to a website, it’s extremely suspicious if they are all coming from one IP address. If you’re not careful, your IP will get banned pretty quick.
The solution to this is the Rotating Proxies and VPN support that Scrapy offers. With this, you can change things so that each request appears to have arrived from a different location. Using this is the closest you’ll get to completely masking the presence of your Web crawler.
XPath and CSS Selectors
XPath and CSS selectors are key to making Scrapy a complete web scraping library. These two are advanced and easy to use techniques through which one can easily scrape through the HTML content on a web page.
XPath in particular is an extremely flexible way of navigating through the HTML structure of a web page. It’s more versatile than CSS selectors, being able to traverse both forward and backward.
Debugging and Logging
Another one of Scrapy’s handy features is the inbuilt debugger and logger. Everything that happens, from the headers used, to the time taken for each page to download, the website latency, etc. is all printed out in the terminal and can be logged into a proper file. Any errors or potential issues that occur are also displayed.
Exception Handling
While web scraping on a large scale, you’ll run into all kinds of different server errors, missing pages, internet issues, etc. Scrapy, with its exception handling allows you to gracefully address each one of these issues without breaking down. You can even pause your Scrapy spider and resume it at a later time.
Scrapy Code
Below are some example codes for Scrapy that we’ve selected from our various tutorials to demonstrate here. Each project example is accompanied by a brief description about its usage.
This first Scrapy code example features a Spider that scans through the entire quotes.toscrape
extracting each and every quote along with the Author’s name.
We’ve used the Rules
class in order to ensure that the Spider scrapes only certain pages (to save time and avoid duplicate quotes) and added some custom settings, such as AutoThrottle
.
class SuperSpider(CrawlSpider):
name = 'spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
base_url = 'http://quotes.toscrape.com'
rules = [Rule(LinkExtractor(allow = 'page/', deny='tag/'),
callback='parse_filter_book', follow=True)]
custom_settings = {
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_DEBUG': True,
}
def parse_filter_book(self, response):
for quote in response.css('div.quote'):
yield {
'Author': quote.xpath('.//span/a/@href').get(),
'Quote': quote.xpath('.//span[@class= "text"]/text()').get(),
Another important feature that Scrapy has is link following which can be implemented in different ways. For instance, the example above also had link following enabled through the Rules
class.
In the below example however, we’re doing it in a unique way that allows us to visit every page on Wikipedia extracting the page names from every single one of them. In short, it’s a more controlled way of link following.
The below code will not actually scrape the entire site due to the DEPTH_LIMIT
setting. We’ve done this simply to limit the Spider around Python related topics and to keep the scraping time reasonable.
from scrapy.spiders import CrawlSpider
class SuperSpider(CrawlSpider):
name = 'follower'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Web_scraping']
base_url = 'https://en.wikipedia.org'
custom_settings = {
'DEPTH_LIMIT': 1
}
def parse(self, response):
for next_page in response.xpath('.//div/p/a'):
yield response.follow(next_page, self.parse)
for quote in response.xpath('.//h1/text()'):
yield {'quote': quote.extract() }
This section doesn’t really contribute much to the Scrapy vs BeautifulSoup debate, but it does help you get an idea of what Scrapy code is like.
Conclusion
If you’re a beginner, I would likely recommend BeautifulSoup over Scrapy. It’s just easier than Scrapy in almost every way, from its setup to its usage. Once you’ve gained some experience, the transition to Scrapy should become easier as they have overlapping concepts.
For simple projects, BeautifulSoup will be more than enough. However, if you’re really serious about making a proper web crawler, then you’ll have to use Scrapy.
Ultimately, you should learn both (while giving preference to Scrapy) and use either one of them depending on the situation.
CodeProject
This marks the end of the Scrapy vs BeautifulSoup article. Any suggestions or contributions are more than welcome. Questions regarding the article content can be asked in the comments section below.