Bot Detection

Implementations used for bot detection.

botdetection.get_network(real_ip: IPv4Address | IPv6Address, cfg: Config) IPv4Network | IPv6Network[source]

Returns the (client) network of whether the real_ip is part of.

botdetection.get_real_ip(request: Request) str[source]

Returns real IP of the request. Since not all proxies set all the HTTP headers and incoming headers can be faked it may happen that the IP cannot be determined correctly.

This function tries to get the remote IP in the order listed below, additional some tests are done and if inconsistencies or errors are detected, they are logged.

The remote IP of the request is taken from (first match):

botdetection.too_many_requests(network: IPv4Network | IPv6Network, log_msg: str) Response | None[source]

Returns a HTTP 429 response object and writes a ERROR message to the ‘botdetection’ logger. This function is used in part by the filter methods to return the default Too Many Requests response.

IP lists

Method ip_lists

The ip_lists method implements IP block- and pass-lists.

Config

[botdetection.ip_lists]

pass_ip = [
 '140.238.172.132', # IPv4 of check.searx.space
 '192.168.0.0/16',  # IPv4 private network
 'fe80::/10'        # IPv6 linklocal
]
block_ip = [
   '93.184.216.34', # IPv4 of example.org
   '257.1.1.1',     # invalid IP --> will be ignored, logged in ERROR class
]

Implementations

botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str][source]

Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.block_ip list.

botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str][source]

Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.pass_ip list.

Rate limit

Method ip_limit

The ip_limit method counts request from an IP in sliding windows. If there are to many requests in a sliding window, the request is evaluated as a bot request. This method requires a redis DB and needs a HTTP X-Forwarded-For header. To take privacy only the hash value of an IP is stored in the redis DB and at least for a maximum of 10 minutes.

The link_token method can be used to investigate whether a request is suspicious. To activate the link_token method in the ip_limit method add the following configuration:

[botdetection.ip_limit]
link_token = true

If the link_token method is activated and a request is suspicious the request rates are reduced:

To intercept bots that get their IPs from a range of IPs, there is a SUSPICIOUS_IP_WINDOW. In this window the suspicious IPs are stored for a longer time. IPs stored in this sliding window have a maximum of SUSPICIOUS_IP_MAX accesses before they are blocked. As soon as the IP makes a request that is not suspicious, the sliding window for this IP is dropped.

Config

[botdetection.ip_limit]

# To get unlimited access in a local network, by default link-lokal addresses
# (networks) are not monitored by the ip_limit
filter_link_local = false

# activate link_token method in the ip_limit method
link_token = false

Implementations

botdetection.ip_limit.API_MAX = 4

Maximum requests from one IP in the API_WINDOW

botdetection.ip_limit.API_WINDOW = 3600

Time (sec) before sliding window for API requests (format != html) expires.

botdetection.ip_limit.BURST_MAX = 15

Maximum requests from one IP in the BURST_WINDOW

botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2

Maximum of suspicious requests from one IP in the BURST_WINDOW

botdetection.ip_limit.BURST_WINDOW = 20

Time (sec) before sliding window for burst requests expires.

botdetection.ip_limit.LONG_MAX = 150

Maximum requests from one IP in the LONG_WINDOW

botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10

Maximum suspicious requests from one IP in the LONG_WINDOW

botdetection.ip_limit.LONG_WINDOW = 600

Time (sec) before the longer sliding window expires.

botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3

Maximum requests from one suspicious IP in the SUSPICIOUS_IP_WINDOW.

botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000

Time (sec) before sliding window for one suspicious IP expires.

Generates a hashed key that fits (more or less) to a WEB-browser session in a network.

Returns current token. If there is no currently active token a new token is generated randomly and stored in the redis DB.

Config:

  • TOKEN_LIVE_TIME

  • TOKEN_KEY

Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument renew is True the expire time of this ping is reset to PING_LIVE_TIME.

This function is called by a request to URL /client<token>.css. If token is valid a PING_KEY for the client is stored in the DB. The expire time of this ping-key is PING_LIVE_TIME.

Prefix of all ping-keys generated by get_ping_key

Key for which the current token is stored in the DB

Probe HTTP headers

Method http_accept

The http_accept method evaluates a request as the request of a bot if the Accept header ..

  • did not contain text/html

Method http_accept_encoding

The http_accept_encoding method evaluates a request as the request of a bot if the Accept-Encoding header ..

  • did not contain gzip AND deflate (if both values are missed)

  • did not contain text/html

Method http_accept_language

The http_accept_language method evaluates a request as the request of a bot if the Accept-Language header is unset.

Method http_connection

The http_connection method evaluates a request as the request of a bot if the Connection header is set to close.

Method http_user_agent

The http_user_agent method evaluates a request as the request of a bot if the User-Agent header is unset or matches the regular expression USER_AGENT.

botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|HeadlessChrome|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'

Regular expression that matches to User-Agent from known bots

Config

Configuration class Config with deep-update, schema validation and deprecated names.

The Config class implements a configuration that is based on structured dictionaries. The configuration schema is defined in a dictionary structure and the configuration data is given in a dictionary structure.

exception botdetection.config.SchemaIssue(level: Literal['warn', 'invalid'], msg: str)[source]

Exception to store and/or raise a message from a schema issue.

class botdetection.config.Config(cfg_schema: Dict, deprecated: Dict[str, str])[source]

Base class used for configuration

default(name: str)[source]

Returns default value of field name in self.cfg_schema.

get(name: str, default: ~typing.Any = <UNSET>, replace: bool = True) Any[source]

Returns the value to which name points in the configuration.

If there is no such name in the config and the default is UNSET, a KeyError is raised.

path(name: str, default: ~typing.Any = <UNSET>)[source]

Get a pathlib.Path object from a config string.

pyobj(name, default: ~typing.Any = <UNSET>)[source]

Get python object refered by full qualiffied name (FQN) in the config string.

set(name: str, val)[source]

Set the value to which name points in the configuration.

If there is no such name in the config, a KeyError is raised.

update(upd_cfg: dict)[source]

Update this configuration by upd_cfg.

validate(cfg: dict)[source]

Validation of dictionary cfg on Config.SCHEMA. Validation is done by validate.