What is URL encoding? How does it work - and what is it used for?
URL encoding is an encoding format used in URLs. The standard allows the use of arbitrary data inside a Uniform Resource Identifier (a URI; typically a URL) while using only a narrow set of US-ASCII characters. The encoding exists because URLs and HTTP request parameters often contain characters (or other data) that cannot be represented with the limited set of US-ASCII characters (i.e. control characters, etc.).
In general, a URI can contain characters that are either reserved or unreserved. Unreserved characters are characters that have no special meaning; they can be displayed as-is and require no special handling. These include uppercase and lowercase letters (A-Z
, a-z
), decimal digits (0-9
), hyphen (-
), period (.
), underscore (_
), and tilde (~
).
Reserved characters, on the other hand, are characters that may delimit the URI into sub-components: characters such as / # &
and others. The following is the list of all reserved characters: ! # $ & ' ( ) * + , / : ; = ? @ [ ]
.
We cannot use reserved character as-is, because this would create ambiguous URIs. For instance, consider URL http://example.com/foo#bar
. Does this URL point to an anchor #bar
inside resource /foo
, or it points to a resource /foo#bar
, that is, a resource whose name contains character #
? Without URL encoding it would be impossible to tell.
We resolve such ambiguities by encoding reserved characters differently when used as data; when used as delimiters, we encode them as-is.
To encode reserved characters, we use the percent-encoding scheme. In percent-encoding, each byte is encoded as a character triplet that consists of the percent character %
followed by the two hexadecimal digits that represent the byte numeric value. For instance, %23
is the percent-encoding for the binary octet 00100011
, which in US-ASCII, corresponds to the character #
. Strictly speaking, while the percent character (%
) isn't reserved, it nonetheless serves as a special indicator for percent-encoded bytes (and therefore requires special handling). Simply put: it must also be percent-encoded (as %25
).
So with percent-encoding, we know that URL http://example.com/foo#bar
points to an anchor bar
inside resource /foo
while http://example.com/foo%23bar
points to resource /foo#bar
where character #
is encoded as %23
.
Percent encoding is also used to represent other characters; characters that are neither reserved nor unreserved. As an example, imagine a GET request containing a non-ASCII string parameter, such as a search query zajec in jež
which is Slovenian for a rabbit and a hedgehog
.
In such cases, we have to first encode non-ASCII characters as UTF-8
and then encode each byte of the new string with percent-encoding. So if we send a GET request to the Duckduckgo search engine containing search query zajec in jež
, we generate the following URL: https://duckduckgo.com/?q=zajec%20in%20je%C5%BE
space
characterYou may have seen cases where the space
character was encoded as character +
, however, the percent-encoding suggests it should be encoded as %20
(in US-ASCII, the space
character is 20
hexadecimal or 32
decimal). So what is going on?
Such encodings are typically created by HTML forms. When a user submits an HTML form, the data is URL-encoded using an early version of the URI percent-encoding rules that contained a number of modifications such as replacing spaces with +
and others.
Note however, that using the +
instead of %20
is valid only when encoding the application/x-www-form-urlencoded
content, such as the query part of an URL. To make this clearer, consider the following cases.
http://www.example.com/search+script.php?search+query=search+term
In this URL, the resource being requested is search+script.php
(the plus character (+
) is part of the filename), while the parameter name is search query
and its value is search term
– in the name of the query parameter and in its value the +
sign is converted to space
while in the name of the resource, search+script.php
, the +
sign remains.
http://www.example.com/search+script.php?search%20query=search%20term
This case is identical to the example above. The difference—using %20
instead of the +
sign in parameter name and value—is only superficial. Both URLs point to the same resource, search+script.php
, and they contain the same parameters.
http://www.example.com/search%20script.php?search%20query=search%20term
This example, however, is different. Here the resource name contains the actual space
character, so the name of the requested resource is search function.php
; the request parameter names and values remain the same as above. Consequently this URL is different from those above.
The application below performs URL encoding and decoding on arbitrary strings. Feel free to test it out (HTML).
Input <br>
<input type="text" name="input" id="input"><br><br>
Output <br>
<input type="text" name="encoded" id="encoded">
<script>
let input = null;
let encoded = null;
document.addEventListener("DOMContentLoaded", () => {
input = document.querySelector("#input");
input.onkeyup = encode;
encoded = document.querySelector("#encoded");
encoded.onkeyup = decode;
});
function encode(event) {
encoded.value = encodeURIComponent(input.value);
}
function decode(event) {
try {
input.value = decodeURIComponent(encoded.value);
} catch (error) {
input.value = "Invalid URI string";
}
}
</script>
Hypertext Transfer Protocol. A protocol that connects web browsers to web servers when they request content.
The act of transferring or saving information into a usable file format.