duck.utils.urlcrack¶
Features:¶
Parse and manipulate URLs effortlessly.
Supports URLs with or without schemes.
Easily update host, port, query, and other components.
Note
This method is more reliable than urllib and similar packages, as they often struggle to handle URLs that lack a scheme (e.g., https).
Example Usage:¶
from urlcrack import URL
url_obj = URL('digreatbrian.tech/some/path?query=something#resource')
# Manipulate the URL object
url_obj.host = "new_site.com"
url_obj.port = 1234 # Set port to None to remove it
print(url_obj.to_str())
# Output: new_site.com:1234/some/path?query=something#resource
URLCrack - A lightweight module providing a robust URL class for parsing and manipulating URLs without relying on the urllib module.
This module handles URLs gracefully, even those without a scheme, addressing limitations found in urllib.parse and similar libraries.
Module Contents¶
Classes¶
Lightweight URL class for manipulating and parsing URLs. |
Functions¶
Returns joined paths but makes sure all paths are included in the final path rather than os.path.join |
Data¶
API¶
- exception duck.utils.urlcrack.InvalidPortError[source]¶
Bases:
ExceptionRaised when the port of the URL is invalid.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- exception duck.utils.urlcrack.InvalidURLAuthorityError[source]¶
Bases:
ExceptionRaised when the authority (netloc) of the URL is invalid.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- exception duck.utils.urlcrack.InvalidURLError[source]¶
Bases:
ExceptionRaised when the URL is invalid or improperly formatted.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- exception duck.utils.urlcrack.InvalidURLPathError[source]¶
Bases:
ExceptionRaised when the URL path is invalid or does not meet expected criteria.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- class duck.utils.urlcrack.URL(url: str, normalize_url: bool = True, normalization_ignore_chars: Optional[List[str]] = None)[source]¶
Lightweight URL class for manipulating and parsing URLs.
This class works on urls without scheme unlike urllib.parse and other libraries.
Initialization
- __repr__()[source]¶
Returns a string representation of the URL.
- Returns:
String representation of the URL.
- Return type:
str
- __slots__¶
None
- build_url_string(scheme: Optional[str] = None, netloc: Optional[str] = None, path: Optional[str] = None, query: Optional[str] = None, fragment: Optional[str] = None) str[source]¶
Converts the current URL object to string.
- property host: Optional[str]¶
Returns the host (excluding port) from the URL object.
- innerjoin(head_url: str, normalize_url: bool = True, normalization_ignore_chars: Optional[List[str]] = None) duck.utils.urlcrack.URL[source]¶
Join the current URL with the provided
head_url, and update the current URL object in-place.- Parameters:
head_url – The relative or absolute URL segment to join with the current URL.
normalize_url – Whether to normalize the url.
normalization_ignore_chars – List of characters to ignore when normalizing the url path. By default, all unsafe characters are stripped.
Behavior:
Performs a URL join operation similar to urllib.parse.urljoin.
The resulting URL replaces the current URL in this object.
Useful for modifying the current object without creating a new instance.
- Returns:
The current URL object with the updated value.
- Return type:
self
- property is_absolute: bool¶
Returns boolean on whether this URL is an absolute URL.
- join(head_url: str, normalize_url: bool = True, normalization_ignore_chars: Optional[List[str]] = None) duck.utils.urlcrack.URL[source]¶
Join the current URL with the provided
head_url, and return a new URL object.- Parameters:
head_url – The relative or absolute URL segment to join with the current URL.
normalize_url – Whether to normalize the url.
normalization_ignore_chars – List of characters to ignore when normalizing the url path. By default, all unsafe characters are stripped.
Behavior:
Performs a URL join operation similar to urllib.parse.urljoin.
Unlike
innerjoin(), this does not modify the current object.Returns a new instance with the resulting joined URL.
- Returns:
A new URL object with the combined URL.
- Return type:
- classmethod normalize_url(url: str, ignore_chars: Optional[List[str]] = None)[source]¶
Normalizes a URL by removing consecutive slashes, adding a leading slash, removing trailing slashes, removing disallowed characters, e.g “<”, string quotes (etc), replacing back slashes and lowercasing the scheme.
- classmethod normalize_url_path(url_path: str, ignore_chars: Optional[List[str]] = None)[source]¶
This normalizes the URL path.
- parse(url: str, normalize_url: bool = True, normalization_ignore_chars: Optional[List[str]] = None)[source]¶
Parse URL from a string.
- Parameters:
normalize_url – Whether to normalize the URL e.g: https://// \google.com>}////path?q`=some_query``; => https://google.com/path?q=some_query
normalization_ignore_chars – List of characters to ignore when normalizing the url path. By default, all unsafe characters are stripped.
Expected input:
scheme://some-site.com/path/... scheme://some-site/... some-site.com/... /some-path/...
- property port: Optional[int]¶
Returns the port from the URL object.
- split_host_and_port(authority: str, convert_port_to_int: bool = True) Tuple[str, Union[str, int]][source]¶
Returns the host and port from authority (netloc).
- Parameters:
authority – The authority or netloc (usually in form ‘some-host:port’)
convert_port_to_int – Whether to automatically convert port to integer (only if port found). Defaults to True.
- Returns:
Tuple containing host and port.
- Return type:
Tuple
- split_path_components(url_path: str) Tuple[str, str, str][source]¶
Returns the path components from a url path.
- Returns:
The tuple containg path, query and fragment.
- Return type:
Tuple
- split_scheme_and_authority(url: str) Tuple[str, str, str][source]¶
Returns the scheme, authority (netloc) and leftover (which might be the path most of the time) from a valid URL.
- Returns:
A tuple containing scheme, netloc and leftover (mostly the path).
- Return type:
Tuple
- classmethod urljoin(base_url: str, head_url: str, replace_authority: bool = False, full_path_replacement: bool = True, normalize_urls: bool = True, normalization_ignore_chars: Optional[List[str]] = None) str[source]¶
Joins 2 URLs and return the result.
… admonition:: Notes
If both URLs has schemes, The new URL will contain the base URL scheme.
- Parameters:
base_url – The base URL
head_url – The URL or URL path to concanetate to the base URL
replace_netloc – Whether to replace URL authority (netloc). If head url has a netloc, it will be the final netloc and this also replaces the final scheme if it is present in head URL. Defaults to False.
full_path_replacement – This means whether to replace the query and fragment even if they are empty in head URL. Defaults to True.
nomalize_urls – Whether to normalize urls.
normalization_ignore_chars – List of characters to ignore when normalizing the url path. By default, all unsafe characters are stripped.
… rubric:: Example
- property user_info: Optional[str]¶
Returns the user info like username@passwd in URL.
- duck.utils.urlcrack.__author__¶
‘Brian Musakwa’
- duck.utils.urlcrack.__email__¶
‘digreatbrian@gmail.com’