When I was finishing my Ph.D. I started to write a blog post about the basics of web bots and bot detection. As of today, it has become relatively long but I haven’t fond the time to finish it yet. That’s why I’ve decided to split it into multiple blog posts.
The first blog post of this series starts with a presentation of the different categories of web bots. In the next blog posts, we’ll discuss bot detection techniques and their resilience against the different bot categories.
What is a web bot ?
A web bot is a program that automates tasks on the web. The main reason to use bots instead of doing these tasks manually is mostly to achieve greater throughput, all while decreasing costs. Web bots are used for several applications ranging from website automated testing to less ethical tasks, such as ad-fraud or credential stuffing. Another common use-case for web bots is crawling, a task that consists in automatically gathering website content.
In this series of blog posts, we focus on crawlers, i.e. bots used to automatically gather content. Nevertheless, most of the concepts presented transpose to bots used for other purposes.
Disclaimer: Respect the robots.txt policy of websites you want to crawl. Moreover, even if a website agrees to be crawled, use rate-limiting to decrease the impact of your crawler on the website.
Categories of bots
We can categorize bots in three main categories depending on their technological stack:
- Instrumentation framework;
- Use of a real browser;
- Use of a headless browser.
Each category has its pros and cons depending on the use case as well as the kind of website crawled (single page app, static content).
Simple HTTP request libraries (bots 1)
The snippet below shows an example of a bot that leverages the urllib.request with BeautifulSoup to get the title of one of the pages of my website.
from bs4 import BeautifulSoup
url = "https://arh.antoinevastel.com/bots/areyouheadless"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, features="html5lib")
h1 = soup.find("h1")
print(h1) # <h1>Are you chrome headless?</h1>
Real browsers (Chrome, Firefox) instrumented with Selenium/Puppeteer (bots 2)
Selenium and Puppeteer are browser automation frameworks. They enable to automate tasks in a browser, such as loading a page, moving the mouse or typing text using code. While Puppeteer is mostly used to automate Chrome and Headless Chrome from NodeJS, Selenium can be used to automate a wide range of browsers, such as Firefox, Chrome, and Safari, using different programming languages (NodeJS, Python, Java).
Headless browsers (bots 3)
In 2017, Google released Headless Chrome, the headless version of Google Chrome, which led to PhantomJS to stop being maintained. Headless Chrome is actively maintained and can be instrumented using the low-level Chrome Devtools Protocol (CDP). However, since writing all the tasks using CDP can be cumbersome, several instrumentation frameworks, such as Puppeteer, have been developed. Since Headless Chrome supports almost all of the features of a normal Chrome browser, websites visited by such bots tend to be quite similar to what a human would see. Thus, it has become the go-to solution for a wide range of applications ranging from automated tests to credential stuffing and crawling. Concerning the CPU and RAM overhead, Chrome headless sits between bots of category 1 based on HTTP-request libraries and bots of categories 2 based on normal instrumented browsers.
In the next blog post (coming soon hopefully), we’ll present how the different categories of bots presented in this blog post can be detected.