Bot Detection¶

Implementations used for bot detection.

botdetection.get_network(real_ip: IPv4Address | IPv6Address, cfg: Config) → IPv4Network | IPv6Network[source]¶: Returns the (client) network of whether the real_ip is part of.

botdetection.get_real_ip(request: Request) → str[source]¶

Returns real IP of the request. Since not all proxies set all the HTTP headers and incoming headers can be faked it may happen that the IP cannot be determined correctly.

This function tries to get the remote IP in the order listed below, additional some tests are done and if inconsistencies or errors are detected, they are logged.

The remote IP of the request is taken from (first match):

botdetection.too_many_requests(network: IPv4Network | IPv6Network, log_msg: str) → Response | None[source]¶: Returns a HTTP 429 response object and writes a ERROR message to the ‘botdetection’ logger. This function is used in part by the filter methods to return the default Too Many Requests response.

IP lists ¶

Method `ip_lists`¶

The ip_lists method implements IP block- and pass-lists.

Config¶

[botdetection.ip_lists]

pass_ip = [
 '140.238.172.132', # IPv4 of check.searx.space
 '192.168.0.0/16',  # IPv4 private network
 'fe80::/10'        # IPv6 linklocal
]
block_ip = [
   '93.184.216.34', # IPv4 of example.org
   '257.1.1.1',     # invalid IP --> will be ignored, logged in ERROR class
]

Implementations¶

botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) → Tuple[bool, str][source]¶: Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.block_ip list.

botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) → Tuple[bool, str][source]¶: Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.pass_ip list.

The ip_limit method counts request from an IP in sliding windows. If there are to many requests in a sliding window, the request is evaluated as a bot request. This method requires a redis DB and needs a HTTP X-Forwarded-For header. To take privacy only the hash value of an IP is stored in the redis DB and at least for a maximum of 10 minutes.

The link_token method can be used to investigate whether a request is suspicious. To activate the link_token method in the ip_limit method add the following configuration:

[botdetection.ip_limit]
link_token = true

If the link_token method is activated and a request is suspicious the request rates are reduced:

BURST_MAX -> BURST_MAX_SUSPICIOUS
LONG_MAX -> LONG_MAX_SUSPICIOUS

To intercept bots that get their IPs from a range of IPs, there is a SUSPICIOUS_IP_WINDOW. In this window the suspicious IPs are stored for a longer time. IPs stored in this sliding window have a maximum of SUSPICIOUS_IP_MAX accesses before they are blocked. As soon as the IP makes a request that is not suspicious, the sliding window for this IP is dropped.

Config¶

[botdetection.ip_limit]

# To get unlimited access in a local network, by default link-lokal addresses
# (networks) are not monitored by the ip_limit
filter_link_local = false

# activate link_token method in the ip_limit method
link_token = false

Implementations¶

botdetection.ip_limit.API_MAX = 4¶: Maximum requests from one IP in the API_WINDOW

botdetection.ip_limit.API_WINDOW = 3600¶: Time (sec) before sliding window for API requests (format != html) expires.

botdetection.ip_limit.BURST_MAX = 15¶: Maximum requests from one IP in the BURST_WINDOW

botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2¶: Maximum of suspicious requests from one IP in the BURST_WINDOW

botdetection.ip_limit.BURST_WINDOW = 20¶: Time (sec) before sliding window for burst requests expires.

botdetection.ip_limit.LONG_MAX = 150¶: Maximum requests from one IP in the LONG_WINDOW

botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10¶: Maximum suspicious requests from one IP in the LONG_WINDOW

botdetection.ip_limit.LONG_WINDOW = 600¶: Time (sec) before the longer sliding window expires.

botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3¶: Maximum requests from one suspicious IP in the SUSPICIOUS_IP_WINDOW.

botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000¶: Time (sec) before sliding window for one suspicious IP expires.

Method `link_token`¶

The link_token method evaluates a request as suspicious if the URL /client<token>.css is not requested by the client. By adding a random component (the token) in the URL, a bot can not send a ping by request a static URL.

Note

This method requires a redis DB and needs a HTTP X-Forwarded-For header.

To get in use of this method a flask URL route needs to be added:

@app.route('/client<token>.css', methods=['GET', 'POST'])
def client_token(token=None):
    link_token.ping(request, token)
    return Response('', mimetype='text/css')

And in the HTML template from flask a stylesheet link is needed (the value of link_token comes from get_token):

<link rel="stylesheet"
      href="{{ url_for('client_token', token=link_token) }}"
      type="text/css" />

Config¶

[botdetection.link_token]
# Livetime (sec) of limiter's CSS token.
TOKEN_LIVE_TIME = 600

# Livetime (sec) of the ping-key from a client (request)
PING_LIVE_TIME = 3600

# Prefix of all ping-keys generated by link_token.get_ping_key
PING_KEY = 'botdetection.link_token.PING_KEY'

# Key for which the current token is stored in the DB
TOKEN_KEY = 'botdetection.link_token.TOKEN_KEY'

Implementations¶

botdetection.link_token.get_ping_key(network: IPv4Network | IPv6Network, request: Request) → str[source]¶: Generates a hashed key that fits (more or less) to a WEB-browser session in a network.

botdetection.link_token.get_token() → str[source]¶

Returns current token. If there is no currently active token a new token is generated randomly and stored in the redis DB.

Config:

TOKEN_LIVE_TIME
TOKEN_KEY

botdetection.link_token.is_suspicious(network: IPv4Network | IPv6Network, request: Request, renew: bool = False)[source]¶: Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument renew is True the expire time of this ping is reset to PING_LIVE_TIME.

botdetection.link_token.ping(request: Request, token: str)[source]¶: This function is called by a request to URL /client<token>.css. If token is valid a PING_KEY for the client is stored in the DB. The expire time of this ping-key is PING_LIVE_TIME.

botdetection.link_token.PING_KEY = 'botdetection.link_token.PING_KEY'¶: Prefix of all ping-keys generated by get_ping_key

botdetection.link_token.TOKEN_KEY = 'botdetection.link_token.TOKEN_KEY'¶: Key for which the current token is stored in the DB

Probe HTTP headers ¶

Method `http_accept`¶

The http_accept method evaluates a request as the request of a bot if the Accept header ..

did not contain text/html

Method `http_accept_encoding`¶

The http_accept_encoding method evaluates a request as the request of a bot if the Accept-Encoding header ..

did not contain gzip AND deflate (if both values are missed)
did not contain text/html

Method `http_accept_language`¶

The http_accept_language method evaluates a request as the request of a bot if the Accept-Language header is unset.

Method `http_connection`¶

The http_connection method evaluates a request as the request of a bot if the Connection header is set to close.

Method `http_user_agent`¶

The http_user_agent method evaluates a request as the request of a bot if the User-Agent header is unset or matches the regular expression USER_AGENT.

botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|HeadlessChrome|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'¶: Regular expression that matches to User-Agent from known bots

Config ¶

Configuration class Config with deep-update, schema validation and deprecated names.

The Config class implements a configuration that is based on structured dictionaries. The configuration schema is defined in a dictionary structure and the configuration data is given in a dictionary structure.

exception botdetection.config.SchemaIssue(level: Literal['warn', 'invalid'], msg: str)[source]¶: Exception to store and/or raise a message from a schema issue.

class botdetection.config.Config(cfg_schema: Dict, deprecated: Dict[str, str])[source]¶

Base class used for configuration

default(name: str)[source]¶: Returns default value of field name in self.cfg_schema.

get(name: str, default: ~typing.Any = <UNSET>, replace: bool = True) → Any[source]¶

Returns the value to which name points in the configuration.

If there is no such name in the config and the default is UNSET, a KeyError is raised.

path(name: str, default: ~typing.Any = <UNSET>)[source]¶: Get a pathlib.Path object from a config string.

pyobj(name, default: ~typing.Any = <UNSET>)[source]¶: Get python object refered by full qualiffied name (FQN) in the config string.

set(name: str, val)[source]¶

Set the value to which name points in the configuration.

If there is no such name in the config, a KeyError is raised.

update(upd_cfg: dict)[source]¶: Update this configuration by upd_cfg.

validate(cfg: dict)[source]¶: Validation of dictionary cfg on Config.SCHEMA. Validation is done by validate.

Bot Detection¶

IP lists ¶

Method `ip_lists`¶

Config¶

Implementations¶

Rate limit ¶

Method `ip_limit`¶

Config¶

Implementations¶

Method `link_token`¶

Config¶

Implementations¶

Probe HTTP headers ¶

Method `http_accept`¶

Method `http_accept_encoding`¶

Method `http_accept_language`¶

Method `http_connection`¶

Method `http_user_agent`¶

Config ¶

Table of Contents

Project Links

Navigation

This Page