When developing a site, one sometimes faces the need to collect a large amount of data posted on another resource, and do it in a short time. A good solution, in this case, is a web page scraper. It saves time and is easy to implement.
In our practice, we've repeatedly dealt with such tasks, and now we'd like to share our experience with you. We're going to tell you about every detail of the web scraping process so that you know what difficulties you may encounter and how to solve these problems.
Interested? Then let's move on to the description of our ruby web scraping project with all its pitfalls we've successfully avoided.
What is a Web Scraper?
Web scraping is essentially an automated collection of data from one or more sites in order to fill your own resource with content, conduct a marketing analysis of the information obtained, and the like.
Of course, ideally, an application or site provides a special API for programmatically accessing its data, but if this isn't an option, web scraping is the only way out.
Usage Areas of Data Scraping:
-
Creating a list of vendors for commercial use. It's about extracting contact information about manufacturers, suppliers or sellers.
-
Scraping product data from various eCommerce platforms.
-
Collection of targeted marketing information.
-
Monitoring and comparison of prices for goods or services in various stores.
-
Obtaining data related to the Human Resources area (vacancies, employees, and others).
-
Extracting breaking news from news resources.
-
Obtaining content of a certain type: say, pictures and their descriptions (we'll talk about this kind of data scraping format below).
How web scraping works
So the key idea of web scraping is clear. And what about its implementation?
Everything happens in 3 main stages:
-
The code snippet used to extract the information sends a specific request to the required website(s).
-
After receiving a response from the online resource(s), the web page scraper parses an HTML document using a specific data template.
-
The extracted data is converted to a specified format set by the developers of the web page scraper.
Of course, this is a very rough description of the process, you'll learn more about web scraping in due time. Just keep reading.
BTW!
Did you know that e-commerce monsters like Walmart and Amazon have special data scraping departments? The goal is to collect price data from all over the Internet and use machine learning algorithms to help customers make the most profitable purchases.
And now it's time to clarify why we needed to start a web scraping project.
Our task and tech stack
We had been working on a special Ruby on Rails resource which required fetching a series of publicly available design images from a couple of websites. It was one of the core features, so we’ve decided to implement this functionality as part of the project.
If we had more time, we could have decoupled the necessary feature to create it as a separate microservice. Alas, we were under a deadline, so our experts had no choice but to take advantage of actual scraping.
The technologies we used to build a web scraper, apart from Rails, are Capybara, selenium-webdriver, webdrivers gem, and whenever needed to schedule scraping. Among default Rails gems, we resorted to open-uri, Nokogiri, and, of course, ActiveRecord to communicate with our database.
In case you’re trying to develop a lightweight microservice, you can omit Rails entirely and use only the gems listed above.
Markup analysis
The first thing to do for data scraping is to detect features to tell the script how to find elements containing the information you need. So usually you have to inspect one of your source web pages with Developer Tools or any other HTML inspector.
Let’s look, for instance, at this sample HTML excerpt:
<ol class="s group is-scrolled">
<li id="screenshot-7805744" data-screenshot-id="7805744" class="group">
<div class="main">
<div class="shot with-actions">
<div class="img">
<a class="link" href="/shots/7805744-Dashboard-Illustration">
<picture>
<source srcset="https://xxx.cdn.com/users/76454/screenshots/7805744/09_2x.png">
<source srcset="https://xxx.cdn.com/users/76454/screenshots/7805744/09_1x.png">
<img alt="Dashboard Illustration" src="https://xxx.cdn.com/users/76454/screenshots/7805744/09_1x.png">
</picture>
</a>
<a class="over" href="/shots/7805744-Dashboard-Illustration">
<strong>Dashboard Illustration</strong>
<span class="timestamp">October 28, 2019</span>
</a>
</div>
<div class="shot-display-options">
<ul class="tools group" style="visibility: visible;">
<li class="fav">
<span class="toggle-fav">325</span>
</li>
<li class="cmnt">
<span original-title="">7</span>
</li>
<li class="views">
<span>22,650</span>
</li>
</ul>
<div class="shot-title-date">
<p class="shot-title">Dashboard Illustration</p>
<p class="timestamp">October 28, 2019</p>
</div>
<div class="extras">
<a href="/zarka">
<span class="rebound-mark is-rebound">
<img width="16" height="16" alt="Rebound" src="https://xxx.cdn.com/assets/c308.png">
</span>
</a>
</div>
<ul class="shot-actions">
<li data-bucket-container="true">
<a class="bucket-shot" title="Save shot" data-href="/signup/new" href="">
Save
</a>
</li>
<li class="like-action-7805744">
<a class="like-shot" title="Like this shot" href="">
Like
</a>
</li>
</ul>
</div>
</div>
</div>
<h2 class="attribution hover-card-parent">
<span class="attribution-user">
<a class="hoverable url" rel="contact" href="/zarka">
<img class="photo" alt="John Zarka" src="https://xxx.cdn.com/users/76454/avatars/872.png">
John Zarka
</a>
<a class="badge-link" href="/pro">
<span class="badge badge-pro">Pro</span>
</a>
</span>
</h2>
</li>
<li id="screenshot-7780088" data-screenshot-id="7780088" class="group">
<div>
<div class="shot with-actions">
<div class="img">
<a class="link" href="/shots/7780088-Colors"><picture>
<source srcset="https://xxx.cdn.com/users/76454/screenshots/7780088/web_2x.png">
<source srcset="https://xxx.cdn.com/users/76454/screenshots/7780088/web_1x.png">
<img alt="Colors" src="https://xxx.cdn.com/users/76454/screenshots/7780088/web_1x.png">
</picture>
</a>
<a class="over" href="/shots/7780088-Colors">
<strong>Colors</strong>
<span class="timestamp">October 23, 2019</span>
</a>
</div>
<div class="shot-display-options">
<ul class="tools group" style="visibility: visible;">
<li class="fav">
<span class="toggle-fav">272</span>
</li>
<li class="cmnt">
<span original-title="">2</span>
</li>
<li class="views">
<span>20,038</span>
</li>
</ul>
<div class="shot-title-date">
<p class="shot-title">Colors</p>
<p class="timestamp">October 23, 2019</p>
</div>
<div class="extras">
<a href="/shots/7780088-Colors/rebounds">
<span class="rebound-mark has-rebounds">0</span>
</a>
<a href="/zarka">
<span class="rebound-mark is-rebound">
<img width="16" height="16" alt="Rebound" src="https://xxx.cdn.com/assets/c308.png">
</span>
</a>
</div>
<ul class="shot-actions">
<li data-bucket-container="true">
<a class="bucket-shot form-btn outlined" title="Save shot" data-signup-trigger="true" data-href="/signup/new" data-context="bucket-shot" href="">
Save
</a>
</li>
<li class="like-action-7780088">
<a class="like-shot" title="Like this shot" href="">
Like
</a>
</li>
</ul>
</div>
</div>
</div>
<h2 class="attribution hover-card-parent">
<span class="attribution-user">
<a class="hoverable url" rel="contact" href="/zarka"><img class="photo" alt="John Zarka" src="https://cdn.com/users/76454/avatars/872.png">
John Zarka
</a>
</span>
</h2>
</li>
</ol>
Of course, no one would guarantee the page structure stay stable, even the best web scraper can fail in case of a huge redesign, but you should rely on classes and ids more than a particular sequence of tags. That’s why you must test your selectors in the browser console before using them in your backend code.
In the example above, we may use…
-
[id^=screenshot] as an item container selector;
-
.shot img to get a picture;
-
.shot-title and .timestamp to retrieve the title and date, respectively;
-
.attribution-user to obtain author information.
Now you have to determine whether the necessary portion of markup should be served as an HTML response or if it is to be rendered via javascript afterward.
You must view the page source code as a raw response. If it contains the data you need, then Nokogiri gem included in Rails by default would be just enough. Otherwise, you’ll have to emulate the web browser, which is a little harder and works slower.
Using Nokogiri
Presuming you already have Screenshot and Author models, the approach is as simple as:
content = open(url).read
html = Nokogiri::HTML(content)
html.css('[id^=screenshot]').each do |shot_container|
image_url = shot_container.css('.shot img').first.attr(:src)
image_content = image_url.present? ? open(image_url) : nil
title = shot_container.css('.shot-title').first.text
date_string = shot_container.css('.timestamp').first.text
author_container = shot_container.css('.attribution-user').first
author_image_url = author_container.css('img').first.attr('src')
author_image_content = autho_image_url.present? ? open(author_image_url) : nil
author = Author.first_or_create{
name: author_container.css('[rel=contact]').first.text.trim,
avatar: author_image_content
}
Screenshot.first_or_create(
image: image_content,
title: title,
date: Date.strptime(date_string, '%B %d, %Y'),
author: author
)
end
You use selectors just like in jQuery. After the first line, you actually work with your own copy of HTML response, so you shouldn’t worry much about the speed of further operations, it's nothing but text parsing.
Now let's move on to such a stage of web scraping as automated acceptance testing.
Using Capybara
In case your desired HTML is rendered by a javascript engine instead of being served to the page immediately, you should imitate browser requests by using some of the tools made for automated acceptance testing.
Capybara gem is really popular in the Rails community, which isn't surprising given the capability of using a number of browser drivers. You may pick any of them, but in our example, we recourse to a chrome headless driver.
If you’re using Rails, you probably already have capybara, selenium-driver, and webdrivers gems in your Gemfile in :test group. If not, add these gems. And then you’ll be able to configure Capybara to make headless Chrome act as a driver:
require 'capybara'
require 'selenium/webdriver'
require 'webdrivers/chromedriver'
Capybara.register_driver :headless_chrome do |app|
capabilities = Selenium::WebDriver::Remote::Capabilities.chrome(
chromeOptions: { args: %w(headless disable-gpu) }
)
Capybara::Selenium::Driver.new app,
browser: :chrome,
desired_capabilities: capabilities
end
Capybara.default_driver = :headless_chrome
And use it like that:
session = Capybara::Session.new(:headless_chrome)
session.visit(url)
session.all('[id^=screenshot]').each do |shot_container|
image_url = shot_container.all('.shot img').first['src']
image_content = image_url.present? ? open(image_url) : nil
title = shot_container.all('.shot-title').first.text
date_string = shot_container.all('.timestamp').first.text
author_container = shot_container.all('.attribution-user').first
author_image_url = author_container.all('img').first['src']
author_image_content = author_image_url.present? ? open(author_image_url) : nil
author = Author.first_or_create{
name: author_container.all('[rel=contact]').first.text.trim,
avatar: author_image_content
}
Screenshot.first_or_create(
image: image_content,
title: title,
date: Date.strptime(date_string, '%B %d, %Y'),
author: author
)
end
Summary
Now you have the idea of web scraping, right? We did our best to explain all the details of this process, and we hope our analysis was useful.
However, if you still have questions - ask them without hesitation!