What are web crawlers, spiders, robots?

Web crawlers or web spiders are automated computer programs that act like browsers and automatically fetch web pages to analyze them. They are frequently used by search engines to collect and index content so it’s useful to understand how they work and how to control them.

Most crawlers only load HTML pages although some of them also collect images (like Google Image Search or DuckDuckGo Image Search) or specific pages of interest. They put all the content into searchable archives that later can be retrieved by searching for various keywords.

What are bad robots?

In recent years about 40% of all internet traffic comes from robots and more than half of this is done by bad robots with malicious intentions. They often crawl the web and collect information without permission to achieve one of the following things:

collect pricing information, catalogs, breaking news
scan for security vulnerabilities
detect typical installations of remote administration tools (e.g. Web MySQL Administrator)
find unprotected pages with sensitive information, data theft

They often scrape websites and automatically publish content elsewhere to gain a competitive advantage, without giving credit to the original source. Search engines generally punish and devaluate duplicate content so whoever gets into their index first can harm the rest of the pages containing the same content.

With the recent lack of computer hardware availability, scalpers build their whole business model on crawling pricing sites for competitive advantage. Price scalping is the illegal practice of buying whole stocks of an item (in this case: computer video cards) and selling them at a higher profit.

Resource costs

Just like spam, robots and crawling web pages increase infrastructure costs of the legitimate service operators who provide web services. Most legitimate robots try to be considerate and not hammer servers with requests but it’s easy to see how they can use limited resources, even making pages unavailable for periods of time due to excessive usage of

bandwidth
CPU costs on the server
other server resources

Bad robots often completely ignore unwritten (and written) rules of crawling and they load web pages as quickly as it is possible.

The legality of crawling

Disclaimer: I’m not a lawyer, this is my personal opinion published for informational purposes only and not for the purpose of providing legal advice.

The crawling of publicly available information is generally considered legal – especially for personal purposes, falling under fair use.

On the other hand, hammering a server with requests to quickly fetch webpages can cause service outages and all sorts of (financial) damage that _may_ be grounds for legal action. The purpose of crawling also matters, there is precedence that website owners sued a crawler operator for using their data for competitive advantage.

With all this in mind, it’s a grey area, and depending on the terms of services of the original website and the nature of the content (especially copyrighted works or password-protected data) it may violate various laws (DMCA, Computer Fraud act, Trespassing, etc). It’s safer to ask for permission first and always honor the website’s terms and their robots.txt exclusions.

The rule of TOR hosting bad robots

TOR is a free service that provides anonymity by hiding its user’s original IP address and showing the address of a TOR endpoint instead. They also host dark web services but the relevant part of their service is that they hide the real IP address behind the IP of the TOR endpoint so they can’t be properly identified. You can find more information on TOR here: https://www.torproject.org.

Malicious robot operators exploit this by making their crawling attempts go through the TOR network. While there are legitimate uses of using TOR (for example accessing web services where they are banned or using the internet in countries with oppressive regimes or simpy increasing privacy), some website operators block TOR altogether to stop bad robot and hacking attempts in their early stages.

The TOR network publishes a frequently updated list of its outgoing IP addresses and by banning those, it’s possible to block all the requests coming from TOR. This is especially useful for non-public parts of websites, for example an administrative area. The exit list is provided here: https://check.torproject.org/torbulkexitlist.

It contains about 1200 addresses, with a simple shell script or even a quick search & replace it’s easy to convert to a .htaccess snippet that blocks all incoming requests from these IP addresses.

wget -qO- 'https://check.torproject.org/torbulkexitlist' | sed 's/^/deny from /'

You can read more about controlling access with Apache here: Protect files against unwanted access with Apache.

Controlling robots with robots.txt

Robots.txt is the standardized way to control and block web crawler behavior. While it’s safe to assume that legitimate crawling services honor robots.txt settings, bad robots often ignore it completely.

A typical robots.txt file looks like this:

User-agent: *
Allow: /

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

User-agent: Bingbot
User-agent: Googlebot
Disallow: /private/

Every time a crawler opens your website, it first downloads this file to check for any exclusion instructions. The “User-agent” line tells who the next Allow/Disallow lines refer to and Allow or Disallow statements list folders or URLs that are either allowed or disallowed to be crawled.

It’s important to know that these rules are purely advisory and it’s at the discretion of the robot operator to follow them.

The “User-agent” is a hidden parameter in every web request. Robot operators set them to something descriptive that can be used to identify them. For example “wget” in the example above sets the user agent to “Wget/version” but it’s a matter of simply adding the –user-agent=”somethingelse” parameter to change it.

It is very important to have a proper robots.txt because, for example, completely blocking all robots (or just Googlebot..) from crawling your website can have serious SEO consequences and it will not stop bad robots.

Google provides a robots.txt tester to verify settings here: https://support.google.com/webmasters/answer/6062598?hl=en.

Block robots by checking their user agent

Web servers provide an easy way to block any request that matches a specific-user agent string. Bad robots often try to hide their activity by setting it to legitimate user agents, but the rest can be blocked by simply looking for the user-agent string and blocking it at the webserver level.

RewriteEngine On
RewriteCond {%HTTP_USER_AGENT} ^bad\ robot [NC,OR]
RewriteCond {%HTTP_USER_AGENT} ^badrobot2 [NC,OR]
RewriteCond {%HTTP_USER_AGENT} www\.badbot\.com [NC,OR]
RewriteRule ^.* - [F,L]

This code block checks the user agent for any of the lines in the “RewriteCond” directives and it blocks all requests coming from them. The user agent specification contains regular expressions so there are a few rules to follow here:

the character ^ at the beginning means start of string so “^hello” means “starts with hello”
spaces should be escaped by adding a backslash (\) before them
dots should be escaped with a backslash – without a backslash dot means “any character”

The [NC] flag makes all matches case insensitive, so lower/uppers case doesn’t matter.

It’s possible to mostly test these changes by setting the blocked user agent to “mozilla” in .htaccess (then undo it once testing is done), that should block most browsers including yours so if you get a “403 forbidden” error, blocking is working. The same testing is possible by running a verify the error code.

wget --user-agent="badrobots" https://test-your-url/whatever.html

As a rule of thumb, always make sure that you don’t ban legitimate user agents, test with multiple browsers after setting this up to verify availability.