Introducing web scraping

Simply put, web scraping is one of the tools developers use to gather and analyze information from the Internet.

Some websites and platforms offer application programming interfaces (APIs) which we can use to access information in a structured way, but others might not. While APIs are certainly becoming the standard way of interacting with today’s popular platforms, we don’t always have this luxury when interacting with most of the websites on the internet.

Rather than reading data from standard API responses, we’ll need to find the data ourselves by reading the website’s pages and feeds.

Some use cases of web scraping

The World Wide Web was born in 1989 and web scraping and crawling entered the conversation not long after in 1993.

Before scraping, search engines were compiled lists of links collected by the website administrator, and arranged into a long list of links somewhere on their website. The first web scraper and crawler, the World Wide Web Wanderer, were created to follow all these indexes and links to try and determine how big the internet was.

It wasn’t long after this that developers started using crawlers and scrapers to create crawler-based search engines that didn’t require human assistance. These crawlers would simply follow links that would come across each page and save information about the page. Since the web is a collaborative effort, the crawler could easily and infinitely follow embedded links on websites to other platforms, and the process would continue forever.

Nowadays, web scraping has its place in nearly every industry. In newsrooms, web scrapers are used to pull in information and trends from thousands of different internet platforms in real time.

Spending a little too much on Amazon this month? Websites exist that will let you know, and, in most cases, will do so by using web scraping to access that specific information on your behalf.

Machine learning and artificial intelligence companies are scraping billions of social media posts to better learn how we communicate online.

So how does it work?

The process a developer builds for web scraping looks a lot like the process a user takes with a browser:

A URL is given to the program.
The program downloads the response from the URL.
The program processes the downloaded file depending on data required.
The program starts over at with a new URL

The nitty gritty comes in steps 3 and, in which data is processed and the program determines how to continue (or if it should at all). For Google’s crawlers, step 3 likely includes collecting all URL links on the page so that the web scraper has a list of places to begin checking next. This is recursiveby design and allows Google to efficiently follow paths and discover new content.

There are many heavily used, well built libraries for reading and working with the downloaded HTML response. In the Ruby ecosystem Nokogiri is the standard for parsing HTML. For Python, BeautifulSoup has been the standard for 15 years. These libraries provide simple ways for us to interact with the HTML from our own programs.

These code libraries will accept the page source as text, and a parser for handling the content of the text. They’ll return helper functions and attributes which we can use to navigate through our HTML structure in predictable ways and find the values we’re looking to extract.

Scraping projects involve a good amount of time spent analyzing a web site’s HTML for classes or identifiers, which we can use to find information on the page. Using the HTML below we can begin to imagine a strategy to extract product information from the table below using the HTML elements with the classes products and product.

<table class="products">
  <tr class="product">...</tr>
  <tr class="product">...</tr>
</table>

In the wild, HTML isn’t always as pretty and predictable. Part of the web scraping process is learning about your data and where it lives on the pages as you go along. Some websites go to great lengths to prevent web scraping, some aren’t built with scraping in mind, and others just have complicated user interfaces which our crawlers will need to navigate through.

Robots.txt

While not an enforced standard, it’s been common since the early days of web scraping to check for the existence and contents of a robots.txt file on each site before scraping its content. This file can be used to define inclusion and exclusion rules that web scrapers and crawlers should follow while crawling the site. You can check out Facebook’s robots.txt file for a robust example: this file is always located at /robots.txt so that scrapers and crawlers can always look for it in the same spot. Additionally, GitHub’s robots.txt, and Twitter’s are good examples.

An example robots.txt file prohibits web scraping and crawling would look like the below:
User-agent: *
Disallow: /

The User-agent: * section is for all web scrapers and crawlers. In Facebook’s, we see that they set User-agent to be more explicit and have sections for Googlebot, Applebot, and others.

The Disallow: / line informs web scrapers and crawlers who observe the robots.txt file that they aren’t permitted to visit any pages on this site. Conversely, if this line read Allow: /, web scrapers and crawlers would be allowed to visit any page on the website.

The robots.txt file can also be a good place to learn information about the website’s architecture and structure. Reading where our scraping tools are allowed to go – and not allowed to go – can help inform us on sections of the website we perhaps didn’t know existed, or may not have thought to look at.

If you’re running a website or platform it’s important to know that this file isn’t always respected by every web crawler and scraper. Larger properties like Google, Facebook, and Twitter respect these guidelines with their crawlers and information scrapers, but since robots.txt is considered a best practice rather than an enforceable standard, you may see different results from different parties. It’s also important not to disclose private information which you wouldn’t want to become public knowledge, like an admin panel on /admin or something like that.

This post is originally published on Kite.com: https://kite.com/blog/python/what-is-web-scraping/