How CAPTCHA challenges are retrieved, processed, and some of their pros/cons.
With an ever-increasing number of crawlers and bots flooding the web, it is no longer a question of whether a CAPTCHA — or some other bot filter — is required. From “Sign In” forms to “Contact Us” forms, bots often take advantage of unsecured/unprotected (i.e. “open” forms or systems) to spam or take down services (a denial of service attack). Other bot filters include IP (VPN) checks, geo-location checks, and more. These more “invasive” techniques tend to be used for more sophisticated applications, such as: protecting order forms, limiting requests, and content filtering.
Having said that, for everything else, a “CAPTCHA” will usually suffice. A “CAPTCHA,” or Completely Automated Public Turing test, is essentially a simple Turing test; tests that only humans should be able to complete. Most “bot filters” operate by testing every user, though the advent of “invisible” bot detection services, such as Google’s reCAPTCHA V3, allows for an improved user experience (UX) with only select users being required to complete a familiar “Are You a Robot?” test.
CAPTCHAs first need to be installed by the website owner, or company operating a website. By choosing a provider for user challenges, one can avoid having to create and manage dynamically generated CAPTCHA challenges on their own. Having said that, a simple block diagram for how challenges are retrieved, and processed, is below:
Once the challenge is shown on the page (challenges can vary from picking a set of images, entering hard-to-read text, or even an auditory challenge for visually impaired users). The request will then be sent to your form/input processing logic, which should look like the following:
With the challenge successfully completed, the form/input can be processed (with an adequately low risk that the submission was not from an automated bot).
VPN IP Validation often involves a primitive check against a database of known IP ranges and is often used for content blocking (Netflix, for example, has their own database to check whether or not an IP address is from a known VPN or datacentre as opposed to a residential/consumer area).
This is used to verify that a user isn’t hundreds — if not thousands — of kilometres away from their IP address (which would indicate that they are not really where they say they are). This can suggest that a user is either fraudulent, or is attempting to fill random information in an attempt to spam or abuse a service.
Such checks are quite invasive to the end-user, but allow a webmaster/company/website owner to see if any particular request is acting strangely. That is, by monitoring movement around a page, the speed of a request and other metrics, a score can be generated to determine whether a request: 1) requires further validation, or 2) is blocked completely.
In an ever-growing Internet, bot filtering is in inevitability that most webmasters have to contend with using. Such techniques — at best — make the user experience (UX) worse. At worst, legitimate users are barred from accessing a particular service. Accessibility concerns also exist from the use of CAPTCHAs, though there have been attempts with Google’s reCAPTCHA and hCAPTCHA to accommodate users’ with impairments with audio-based CAPTCHAs (hCAPTCHA offers a service where disabled users can opt-out of a limited number of CAPTCHAs per day).
All in all, while CAPTCHAs continue to grow in complexity and popularity, the “it only takes one to ruin it for all” (“it” being automated bots and spammers) mantra holds ever true. CAPTCHA filling services (that bypass the “Turing Test” by outsourcing the completion of bot checks) and more advanced bots are a growing threat to existing filtering techniques. Thus, for applications that require the assurance, CAPTCHAs can be combined with a (non-exhaustive) selection of: VPN/datacenter IP checks, behaviour-based filtering and fraud detection tools/databases.
An automated application used to scrape (i.e. take) content from other sources.
An automated tool that can perform a variety of pre-determined actions, such as scraping or sending form submissions.
Completely Automated Public Turing Challenge. A security test used to deter bots.
Turing tests are used to distinguish whether a particular entity is a computer, or human. More specifically, they are a measure of a bot/computer's intelligence relative to a human.