Detecting and optimising your site for bot traffic

Bots come in all shapes and sizes. Many are legitimate, such as Search Engine bots (or spiders), which crawl your website to include pages on search engines like Google. There are plenty of illegitimate bots too – one common issue is referral spam, advertising sites by making them show up in your web analytics, consuming your resources and usage limits.

Detecting bots

Legitimate bots

Legitimate bots will send a User-Agent which indicates that they’re a bot. Most of these include bot in them, such as Googlebot, but there’s a long list of rules and exceptions. One frustrating example is that Cubot is often detected as a bot, but it’s not a bot at all – it’s a mobile phone manufacturer.

For those reasons, it’s best to use a library to detect whether it’s a bot. Most major programming languages will have a library available – for Node.js, there’s a library named [isbot](https://github.com/omrilotan/isbot) and for PHP there’s one named [device-detector](https://github.com/matomo-org/device-detector).

Usage of isbot is simple:

const isbot = require('isbot')

isbot('Googlebot/2.1 (+http://www.google.com/bot.html)') // true
isbot('Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36') // false

Illegitimate bots

Referral spam bots, and other illegitimate bots, will try to disguise themselves as normal visitors to your website. For that reason, they won’t set a recognisable User-Agent, and it’s impossible to reliably detect all of them.

Luckily, there’s an open-source list of known referral spammers. We can compare the Referer header against this list and decide how to handle it…

const fs = require('fs')
// referral-spammers.txt downloaded from https://github.com/matomo-org/referrer-spam-list/blob/master/spammers.txt
const spammerList = fs.readFileSync('./referral-spammers.txt')
  .toString()
  .split('\n') // each spammer is on a new line
  .filter(Boolean) // filter out empty lines

const isSpammer = referer => {
  return referer && spammerList.some(spammer => referer.includes(spammer))
}

isSpammer('https://casino-top3.ru') // true
isSpammer('https://ipdata.co') // false

Bots which aren’t referral spammers will be very difficult to detect. It’s best to rely on threat data to ensure you’re able to block requests from IPs which have been flagged as malicious. ipdata’s threat API can help to do exactly this. Additionally, ipdata’s ASN API can detect IP addresses which are associated with hosting providers, such as AWS. Hosting IPs are often used for bots and hacking attempts, but they could also be legitimate proxies. Here’s a small script, which we’ll use later, to detect hosting providers and threats.

const axios = require("axios");
// Get an ipdata API Key from here: https://ipdata.co/sign-up.html
const IPDATA_API_KEY = "test";
const getIpData = async (ip) => {
  const response = await axios.get(
    `https://api.ipdata.co/${ip}?api-key=${IPDATA_API_KEY}`
  );
  return response.data;
};

// Let's test it out!
const microsoftIpData = await getIpData("13.107.6.152");
microsoftIpData.threat.is_threat; // false - this IP is not associated with threats
microsoftIpData.asn.type; // "hosting" - this IP is a Microsoft hosting IP

Responding to bot traffic

Now we’ve detected a majority of our bot traffic, we need to decide what to do with it!

Excluding bots from analytics

Most people want to exclude all bot traffic from their web analytics tools. Many tools have a built-in methods for excluding bots, but if you wish to control this yourself, you can exclude the tracking codes. This can help prevent bots from consuming your usage limits, and ensures consistency across different tools.

In Node.js with Express and EJS, we can do this with just a few lines of code.

**index.js**

const isbot = require('isbot')
const isSpammer = require('./is-spammer') // isSpammer function defined above
const getIpData = require('./get-ip-data') // getIpData function defined above
const express = require("express")
const app = express()

app.set('view engine', 'ejs')

app.get("/", async (req, res) => {
  const ipdata = await getIpData(req.connection.remoteAddress)
  if (ipdata.threat.is_threat) {
    // Block threats altogether!
    res.status(403).send('You are not allowed to access this site')
    return
  }
  const includeAnalytics = 
    !isbot(req.get('user-agent')) && 
    !isSpammer(req.get('referer')) &&
    ipdata.asn.type !== 'hosting'

  res.render("./index.ejs", { includeAnalytics })
})

app.listen(8000)

**views/index.ejs**

<html>
  <head>
    <% if (includeAnalytics) { %>
      <script>
        // Google Analytics here
        // Other analytics scripts too..
      </script>
    <% } %>
  </head>
  <body>
    Hello!
  </body>
</html>

Now, we can test our implementation using curl.

Browser with no referrer

curl -H "user-agent: A-Browser" http://localhost:8000
<html>
  <head>
    <script>
      // Google Analytics here
      // Other analytics scripts too..
    </script>
  </head>
  <body>
    Hello!
  </body>
</html>

Bot user-agent

curl -H "user-agent: bot" http://localhost:8000
<html>
  <head>
  </head>
  <body>
    Hello!
  </body>
</html>

Referral spammer

curl -H "user-agent: A-Browser" -H "referer: https://casino-top3.ru" http://localhost:8000
<html>
  <head>
  </head>
  <body>
    Hello!
  </body>
</html>

As you can see, when we request the website from a normal browser, they’re served the analytics scripts, but bots and referral spammers are excluded.

This can easily be adapted to totally block traffic from bots or referral spammers from accessing your website, you can treat them like threats (see line 11 of index.js), but this should usually be unnecessary for most sites.

Serving a Search Engine Optimised version of your site

Serving a different version of your website to crawlers can cause a major improvement in SEO. Modern websites, especially single-page-applications can be hard for crawlers to read and index, so we can serve them a simplified version.

The approach is very similar to excluding bots from analytics.

**index.js**

const isbot = require('isbot')
const express = require("express");
const app = express();

app.set('view engine', 'ejs')

app.get("/", (req, res) => {
  if (isbot(req.get('user-agent'))) {
    res.render('./seo.ejs')
    return
  }
  res.render("./index.ejs")
});

app.listen(8000);

**views/seo.ejs**

<html>
  <body>
    Hello search engine crawlers + bots!
  </body>
</html>

**views/index.ejs**

<html>
  <body>
    Hello normal visitors
  </body>
</html>

Again, we can test our implementation with curl.

Browser

curl -H "user-agent: A-Browser" http://localhost:8000
<html>
  <body>
    Hello normal visitors
  </body>
</html>

Bot

curl -H "user-agent: bot" http://localhost:8000
<html>
  <body>
    Hello search engine crawlers + bots!
  </body>
</html>

To ensure high performance, which can also improve your search rankings, you may want to use an Edge Computing solution, such as Lambda@Edge or Cloudflare Workers. For detailed guides on edge computing, check out our posts: Using Cloudflare Workers to redirect users based on country ****and Redirect users based on country with Lambda@Edge.

Conclusions

Detecting and adapting your website for bot traffic can be incredibly valuable. Simply excluding them from your analytics can reduce your usage costs and keep your data clean – bots can cause your conversion rates to drop, distracting you from focussing on real customers. Serving a specific site dedicated to SEO won’t be required often, as search crawlers are able to index most websites, but for highly-complex single page applications, it can have a huge impact on your search rankings.

Let us know your bot-detection use-cases in the comments below!

How to detect and block bot traffic on your website