urlutils - Structured URL¶
urlutils is a module dedicated to one of software’s most
versatile, well-aged, and beloved data structures: the URL, also known
as the Uniform Resource Locator.
Among other things, this module is a full reimplementation of URLs,
without any reliance on the
library modules. The centerpiece and top-level interface of urlutils
URL type. Also featured is the
convenience function. Some low-level functions and constants are also
New in version 17.2.
The URL type¶
The URL is one of the most ubiquitous data structures in the virtual and physical landscape. From blogs to billboards, URLs are so common, that it’s easy to overlook their complexity and power.
There are 8 parts of a URL, each with its own semantics and special characters:
Each is exposed as an attribute on the URL object. RFC 3986 offers this brief structural summary of the main URL components:
foo://user:firstname.lastname@example.org:8042/over/there?name=ferret#nose \_/ \_______/ \_________/ \__/\_________/ \_________/ \__/ | | | | | | | scheme userinfo host port path query fragment
And here’s how that example can be manipulated with the URL type:
>>> url = URL('foo://example.com:8042/over/there?name=ferret#nose') >>> print(url.host) example.com >>> print(url.get_authority()) example.com:8042 >>> print(url.qp['name']) # qp is a synonym for query_params ferret
URL’s approach to encoding is that inputs are decoded as much as possible, and data remains in this decoded state until re-encoded using the
to_text()method. In this way, it’s similar to Python’s current approach of encouraging immediate decoding of bytes to text.
Note that URL instances are mutable objects. If an immutable representation of the URL is desired, the string from
to_text()may be used. For an immutable, but almost-as-featureful, URL object, check out the hyperlink package.
The scheme is an ASCII string, normally lowercase, which specifies the semantics for the rest of the URL, as well as network protocol in many cases. For example, “http” in “http://hatnote.com”.
The username is a string used by some schemes for authentication. For example, “public” in “ftp://email@example.com”.
The password is a string also used for authentication. Technically deprecated by RFC 3986 Section 7.5, they’re still used in cases when the URL is private or the password is public. For example “password” in “db://private:firstname.lastname@example.org”.
The host is a string used to resolve the network location of the resource, either empty, a domain, or IP address (v4 or v6). “example.com”, “127.0.0.1”, and “::1” are all good examples of host strings.
As is the case for 80 for HTTP and 22 for SSH, many schemes have default ports, and Section 3.2.3 of RFC 3986 states that when a URL’s port is the same as its scheme’s default port, the port should not be emitted:
>>> URL(u'https://github.com:443/mahmoud/boltons').to_text() u'https://github.com/mahmoud/boltons'
The string starting with the first leading slash after the authority part of the URL, ending with the first question mark. Often percent-quoted for network use. “/a/b/c” is the path of “http://example.com/a/b/c?d=e”.
path, split on slashes. Empty slash segments are preserved, including that of the leading slash:
>>> url = URL(u'http://example.com/a/b/c') >>> url.path_parts (u'', u'a', u'b', u'c')
>>> url = URL('http://boltons.readthedocs.io/en/latest/?utm_source=docs&sphinx=ok') >>> url.qp.keys() [u'utm_source', u'sphinx']
Also percent-encoded for network use cases.
The string following the first ‘#’ after the
query_paramsuntil the end of the URL. It has no inherent internal structure, and is percent-quoted.
from_parts(scheme=None, host=None, path_parts=(), query_params=(), fragment=u'', port=None, username=None, password=None)¶
Build a new URL from parts. Note that the respective arguments are not in the order they would appear in a URL:
- scheme (str) – The scheme of a URL, e.g., ‘http’
- host (str) – The host string, e.g., ‘hatnote.com’
- path_parts (tuple) – The individual text segments of the path, e.g., (‘post’, ‘123’)
- query_params (dict) – An OMD, dict, or list of (key, value) pairs representing the keys and values of the URL’s query parameters.
- fragment (str) – The fragment of the URL, e.g., ‘anchor1’
- port (int) – The integer port of URL, automatic defaults are available for registered schemes.
- username (str) – The username for the userinfo part of the URL.
- password (str) – The password for the userinfo part of the URL.
Note that this method does relatively little validation.
URL.to_text()should be used to check if any errors are produced while composing the final textual URL.
Render a string representing the current state of the URL object.
>>> url = URL('http://listen.hatnote.com') >>> url.fragment = 'en' >>> print(url.to_text()) http://listen.hatnote.com#en
By setting the full_quote flag, the URL can either be fully quoted or minimally quoted. The most common characteristic of an encoded-URL is the presence of percent-encoded text (e.g., %60). Unquoted URLs are more readable and suitable for display, whereas fully-quoted URLs are more conservative and generally necessary for sending over the network.
Return the default port for the currently-set scheme. Returns
Noneif the scheme is unrecognized. See
portmatches this value, no port is emitted in the output of
Applies the same ‘+’ heuristic detailed in
Whether or not a URL uses
://to separate the scheme from the rest of the URL depends on the scheme’s own standard definition. There is no way to infer this behavior from other parts of the URL. A scheme either supports network locations or it does not.
The URL type’s approach to this is to check for explicitly registered schemes, with common schemes like HTTP preregistered. This is the same approach taken by
URL adds two additional heuristics if the scheme as a whole is not registered. First, it attempts to check the subpart of the scheme after the last
+character. This adds intuitive behavior for schemes like
git+ssh. Second, if a URL with an unrecognized scheme is loaded, it will maintain the separator it sees.
>>> print(URL('fakescheme://test.com').to_text()) fakescheme://test.com >>> print(URL('mockscheme:hello:world').to_text()) mockscheme:hello:world
Used by URL schemes that have a network location,
portinto one string, the authority, that is used for connecting to a network-accessible resource.
Used internally by
to_text()and can be useful for labeling connections.
>>> url = URL('ftp://email@example.com:2121/debian/README') >>> print(url.get_authority()) ftp.debian.org:2121 >>> print(url.get_authority(with_userinfo=True)) firstname.lastname@example.org:2121
Resolve any “.” and “..” references in the path, as well as normalize scheme and host casing. To turn off case normalization, pass
More information can be found in Section 6.2.2 of RFC 3986.
Factory method that returns a _new_
URLbased on a given destination, dest. Useful for navigating those relative links with ease.
The newly created
URLis normalized before being returned.
>>> url = URL('http://boltons.readthedocs.io') >>> url.navigate('en/latest/') URL(u'http://boltons.readthedocs.io/en/latest/')
Parameters: dest (str) – A string or URL object representing the destination
More information can be found in Section 5 of RFC 3986.
A slew of functions used internally by
Used to parse the text for a single URL into a dictionary, used internally by the
Note that “URL” has a very narrow, standards-based definition. While
URLParseErrorunder a very limited number of conditions, such as non-integer port, a surprising number of strings are technically valid URLs. For instance, the text
"url"is a valid URL, because it is a relative path.
In short, do not expect this function to validate form inputs or other more colloquial usages of URLs.
>>> res = parse_url('http://127.0.0.1:3000/?a=1') >>> sorted(res.keys()) # res is a basic dictionary ['_netloc_sep', 'authority', 'family', 'fragment', 'host', 'password', 'path', 'port', 'query', 'scheme', 'username']
Low-level function used to parse the host portion of a URL.
Returns a tuple of (family, host) where family is a
socketmodule constant or
None, and host is a string.
>>> parse_host('googlewebsite.com') == (None, 'googlewebsite.com') True >>> parse_host('[::1]') == (socket.AF_INET6, '::1') True >>> parse_host('192.168.1.1') == (socket.AF_INET, '192.168.1.1') True
Odd doctest formatting above due to py3’s switch from int to enums for
parse_qsl(qs, keep_blank_values=True, encoding='utf8')¶
Converts a query string into a list of (key, value) pairs.
Normalize the URL path by resolving segments of ‘.’ and ‘..’, resulting in a dot-free path. See RFC 3986 section 5.2.4, Remove Dot Segments.
A subclass of
OrderedMultiDictspecialized for representing query string values. Everything is fully unquoted on load and all parsed keys and values are strings by default.
As the name suggests, multiple values are supported and insertion order is preserved.
>>> qp = QueryParamDict.from_text(u'key=val1&key=val2&utm_source=rtd') >>> qp.getlist('key') [u'val1', u'val2'] >>> qp['key'] u'val2' >>> qp.add('key', 'val3') >>> qp.to_text() 'key=val1&key=val2&utm_source=rtd&key=val3'
OrderedMultiDictfor more API features.
URLs have many parts, and almost as many individual “quoting” (encoding) strategies.
Quote special characters in either the username or password section of the URL. Note that userinfo in URLs is considered deprecated in many circles (especially browsers), and support for percent-encoded userinfo can be spotty.
Percent-encode a single segment of a URL path.
Percent-encode a single query string key or value.
Quote the fragment part of the URL. Fragments don’t have subdelimiters, so the whole URL fragment can be passed.
There is however, only one unquoting strategy:
unquote(string, encoding='utf-8', errors='replace')¶
Percent-decode a string, by replacing %xx escapes with their single-character equivalent. The optional encoding and errors parameters specify how to decode percent-encoded sequences into Unicode characters, as accepted by the
bytes.decode()method. By default, percent-encoded sequences are decoded with UTF-8, and invalid sequences are replaced by a placeholder character.
>>> unquote(u'abc%20def') u'abc def'
Keys are lowercase strings, values are integers or None, with None indicating that the scheme does not have a default port (or may not support ports at all):
>>> boltons.urlutils.SCHEME_PORT_MAP['http'] 80 >>> boltons.urlutils.SCHEME_PORT_MAP['file'] None
Also available in JSON.