Case Study: How We Built Web Scraper on Ruby on Rails

Development

7 min read

Helen Vakhnenko

April 02, 2024

Case Study: How We Built Web Scraper on Ruby on Rails

When developing a site, one sometimes faces the need to collect a large amount of data posted on another resource, and do it in a short time. A good solution, in this case, is a web page scraper. It saves time and is easy to implement.

In our practice, we've repeatedly dealt with such tasks, and now we'd like to share our experience with you. We're going to tell you about every detail of the web scraping process so that you know what difficulties you may encounter and how to solve these problems.

Interested? Then let's move on to the description of our ruby web scraping project with all its pitfalls we've successfully avoided.

What is a Web Scraper?

Web scraping is essentially an automated collection of data from one or more sites in order to fill your own resource with content, conduct a marketing analysis of the information obtained, and the like.

Of course, ideally, an application or site provides a special API for programmatically accessing its data, but if this isn't an option, web scraping is the only way out.

Usage Areas of Data Scraping:

Creating a list of vendors for commercial use. It's about extracting contact information about manufacturers, suppliers or sellers.
Scraping product data from various eCommerce platforms.
Collection of targeted marketing information.
Monitoring and comparison of prices for goods or services in various stores.
Obtaining data related to the Human Resources area (vacancies, employees, and others).
Extracting breaking news from news resources.
Obtaining content of a certain type: say, pictures and their descriptions (we'll talk about this kind of data scraping format below).

web scraping

How web scraping works

So the key idea of web scraping is clear. And what about its implementation?

Everything happens in 3 main stages:

The code snippet used to extract the information sends a specific request to the required website(s).
After receiving a response from the online resource(s), the web page scraper parses an HTML document using a specific data template.
The extracted data is converted to a specified format set by the developers of the web page scraper.

Of course, this is a very rough description of the process, you'll learn more about web scraping in due time. Just keep reading.

BTW!

Did you know that e-commerce monsters like Walmart and Amazon have special data scraping departments? The goal is to collect price data from all over the Internet and use machine learning algorithms to help customers make the most profitable purchases.

web scraper app

And now it's time to clarify why we needed to start a web scraping project.

Our task and tech stack

We had been working on a special Ruby on Rails resource which required fetching a series of publicly available design images from a couple of websites. It was one of the core features, so we’ve decided to implement this functionality as part of the project.

If we had more time, we could have decoupled the necessary feature to create it as a separate microservice. Alas, we were under a deadline, so our experts had no choice but to take advantage of actual scraping.

The technologies we used to build a web scraper, apart from Rails, are Capybara, selenium-webdriver, webdrivers gem, and whenever needed to schedule scraping. Among default Rails gems, we resorted to open-uri, Nokogiri, and, of course, ActiveRecord to communicate with our database.

In case you’re trying to develop a lightweight microservice, you can omit Rails entirely and use only the gems listed above.

Markup analysis

The first thing to do for data scraping is to detect features to tell the script how to find elements containing the information you need. So usually you have to inspect one of your source web pages with Developer Tools or any other HTML inspector.

Let’s look, for instance, at this sample HTML excerpt:

<ol class="s group is-scrolled">
  <li id="screenshot-7805744" data-screenshot-id="7805744" class="group">
      <div class="main">
          <div class="shot with-actions">
              <div class="img">
                  <a class="link" href="/shots/7805744-Dashboard-Illustration">

                       <picture>
                          <source srcset="https://xxx.cdn.com/users/76454/screenshots/7805744/09_2x.png">
                          <source srcset="https://xxx.cdn.com/users/76454/screenshots/7805744/09_1x.png">
                          <img alt="Dashboard Illustration" src="https://xxx.cdn.com/users/76454/screenshots/7805744/09_1x.png">
                      </picture>
                  </a>
                  <a class="over" href="/shots/7805744-Dashboard-Illustration">
                      <strong>Dashboard Illustration</strong>
                      <span class="timestamp">October 28, 2019</span>
                  </a>
              </div>

              <div class="shot-display-options">
                  <ul class="tools group" style="visibility: visible;">
                      <li class="fav">
                          <span class="toggle-fav">325</span>
                      </li>
                      <li class="cmnt">
                          <span original-title="">7</span>
                      </li>
                      <li class="views">
                          <span>22,650</span>
                      </li>
                  </ul>

                  <div class="shot-title-date">
                      <p class="shot-title">Dashboard Illustration</p>
                      <p class="timestamp">October 28, 2019</p>
                  </div>

                  <div class="extras">
                      <a href="/zarka">
                          <span class="rebound-mark is-rebound">
                              <img width="16" height="16" alt="Rebound" src="https://xxx.cdn.com/assets/c308.png">
                          </span>
                      </a>
                  </div>

                  <ul class="shot-actions">
                      <li data-bucket-container="true">
                          <a class="bucket-shot" title="Save shot" data-href="/signup/new" href="">
                              Save
                          </a>
                      </li>

                      <li class="like-action-7805744">
                          <a class="like-shot" title="Like this shot" href="">
                              Like
                          </a>
                      </li>
                  </ul>
              </div>
          </div>
      </div>

      <h2 class="attribution hover-card-parent">
          <span class="attribution-user">
              <a class="hoverable url" rel="contact" href="/zarka">
                  <img class="photo" alt="John Zarka" src="https://xxx.cdn.com/users/76454/avatars/872.png">
                  John Zarka
              </a>
              <a class="badge-link" href="/pro">
                <span class="badge badge-pro">Pro</span>
              </a>
          </span>
      </h2>
  </li>
  <li id="screenshot-7780088" data-screenshot-id="7780088" class="group">
      <div>
          <div class="shot with-actions">
              <div class="img">
                  <a class="link" href="/shots/7780088-Colors"><picture>
                      <source srcset="https://xxx.cdn.com/users/76454/screenshots/7780088/web_2x.png">
                      <source srcset="https://xxx.cdn.com/users/76454/screenshots/7780088/web_1x.png">
                      <img alt="Colors" src="https://xxx.cdn.com/users/76454/screenshots/7780088/web_1x.png">
                  </picture>
                  </a>
                  <a class="over" href="/shots/7780088-Colors">
                      <strong>Colors</strong>
                      <span class="timestamp">October 23, 2019</span>
                  </a>
              </div>

              <div class="shot-display-options">
                  <ul class="tools group" style="visibility: visible;">
                      <li class="fav">
                          <span class="toggle-fav">272</span>
                      </li>
                      <li class="cmnt">
                          <span original-title="">2</span>
                      </li>
                      <li class="views">
                          <span>20,038</span>
                      </li>
                  </ul>

                  <div class="shot-title-date">
                      <p class="shot-title">Colors</p>
                      <p class="timestamp">October 23, 2019</p>
                  </div>

                  <div class="extras">
                      <a href="/shots/7780088-Colors/rebounds">
                          <span class="rebound-mark has-rebounds">0</span>
                      </a>
                      <a href="/zarka">
                          <span class="rebound-mark is-rebound">
                              <img width="16" height="16" alt="Rebound" src="https://xxx.cdn.com/assets/c308.png">
                          </span>
                      </a>
                  </div>

                  <ul class="shot-actions">
                      <li data-bucket-container="true">
                          <a class="bucket-shot form-btn outlined" title="Save shot" data-signup-trigger="true" data-href="/signup/new" data-context="bucket-shot" href="">
                              Save
                          </a>
                      </li>

                      <li class="like-action-7780088">
                          <a class="like-shot" title="Like this shot" href="">
                              Like
                          </a>
                      </li>
                  </ul>
              </div>
          </div>
      </div>

      <h2 class="attribution hover-card-parent">
          <span class="attribution-user">
              <a class="hoverable url" rel="contact" href="/zarka"><img class="photo" alt="John Zarka" src="https://cdn.com/users/76454/avatars/872.png">
                  John Zarka
              </a>
          </span>
      </h2>
  </li>
</ol>

As you can see, there is a list of shots with additional information. Let’s say it’s necessary to fetch images, titles, and creation dates, as well as the author’s names and avatars. We have to write selectors (using either CSS or XPath syntax) to get to the required content while keeping in mind that the markup might change a bit and it shouldn’t break our logic if this happens.

Of course, no one would guarantee the page structure stay stable, even the best web scraper can fail in case of a huge redesign, but you should rely on classes and ids more than a particular sequence of tags. That’s why you must test your selectors in the browser console before using them in your backend code.

In the example above, we may use…

[id^=screenshot] as an item container selector;
.shot img to get a picture;
.shot-title and .timestamp to retrieve the title and date, respectively;
.attribution-user to obtain author information.

Now you have to determine whether the necessary portion of markup should be served as an HTML response or if it is to be rendered via javascript afterward.

You must view the page source code as a raw response. If it contains the data you need, then Nokogiri gem included in Rails by default would be just enough. Otherwise, you’ll have to emulate the web browser, which is a little harder and works slower.

Do you want to know when you can also use actual scraping? When creating price comparison sites! Click to find out more.

Using Nokogiri

Presuming you already have Screenshot and Author models, the approach is as simple as:

content = open(url).read
html = Nokogiri::HTML(content)

html.css('[id^=screenshot]').each do |shot_container|
image_url = shot_container.css('.shot img').first.attr(:src)
image_content = image_url.present? ? open(image_url) : nil
title = shot_container.css('.shot-title').first.text
date_string = shot_container.css('.timestamp').first.text

author_container = shot_container.css('.attribution-user').first
author_image_url = author_container.css('img').first.attr('src')
author_image_content = autho_image_url.present? ? open(author_image_url) : nil

author = Author.first_or_create{
  name: author_container.css('[rel=contact]').first.text.trim,
  avatar: author_image_content
}

Screenshot.first_or_create(
  image: image_content,
  title: title,
  date: Date.strptime(date_string, '%B %d, %Y'),
  author: author
)
end

Of course, the above code lacks error handling, and the database queries aren't optimized (among others, the query for the author is performed at each iteration, even if it’s the same author every time), but you get a general idea.

You use selectors just like in jQuery. After the first line, you actually work with your own copy of HTML response, so you shouldn’t worry much about the speed of further operations, it's nothing but text parsing.

We're ready to offer you highly qualified web development services. Click here to learn more.

Now let's move on to such a stage of web scraping as automated acceptance testing.

Using Capybara

In case your desired HTML is rendered by a javascript engine instead of being served to the page immediately, you should imitate browser requests by using some of the tools made for automated acceptance testing.

Capybara gem is really popular in the Rails community, which isn't surprising given the capability of using a number of browser drivers. You may pick any of them, but in our example, we recourse to a chrome headless driver.

If you’re using Rails, you probably already have capybara, selenium-driver, and webdrivers gems in your Gemfile in :test group. If not, add these gems. And then you’ll be able to configure Capybara to make headless Chrome act as a driver:

require 'capybara'

require 'selenium/webdriver'

require 'webdrivers/chromedriver'


Capybara.register_driver :headless_chrome do |app|
  capabilities = Selenium::WebDriver::Remote::Capabilities.chrome(
    chromeOptions: { args: %w(headless disable-gpu) }
  )

  Capybara::Selenium::Driver.new app,
    browser: :chrome,
    desired_capabilities: capabilities
end

Capybara.default_driver = :headless_chrome

Suppose you have an app idea needing a web page scraper... but what do you do next? Our article will answer your question!

And use it like that:

session = Capybara::Session.new(:headless_chrome)
session.visit(url)

session.all('[id^=screenshot]').each do |shot_container|
image_url = shot_container.all('.shot img').first['src']
image_content = image_url.present? ? open(image_url) : nil
title = shot_container.all('.shot-title').first.text
date_string = shot_container.all('.timestamp').first.text

author_container = shot_container.all('.attribution-user').first
author_image_url = author_container.all('img').first['src']
author_image_content = author_image_url.present? ? open(author_image_url) : nil

author = Author.first_or_create{
  name: author_container.all('[rel=contact]').first.text.trim,
  avatar: author_image_content
}

Screenshot.first_or_create(
  image: image_content,
  title: title,
  date: Date.strptime(date_string, '%B %d, %Y'),
  author: author
)
end

The code is pretty much the same as with Nokogiri, but with slightly different syntax. Once again, you should always be ready to fire exceptions, silently log or process in some other way all cases of absent page elements. There is a lot of room for perfecting your future background tasks, but it goes beyond the scope of our article.