duck.utils.urlcrack

Features:

  • Parse and manipulate URLs effortlessly.

  • Supports URLs with or without schemes.

  • Easily update host, port, query, and other components.

Note

This method is more reliable than urllib and similar packages, as they often struggle to handle URLs that lack a scheme (e.g., https).

Example Usage:

from urlcrack import URL

url_obj = URL('digreatbrian.tech/some/path?query=something#resource')

# Manipulate the URL object
url_obj.host = "new_site.com"
url_obj.port = 1234  # Set port to None to remove it

print(url_obj.to_str())
# Output: new_site.com:1234/some/path?query=something#resource

Author:

Brian Musakwa digreatbrian@gmail.com

URLCrack - A lightweight module providing a robust URL class for parsing and manipulating URLs without relying on the urllib module.

This module handles URLs gracefully, even those without a scheme, addressing limitations found in urllib.parse and similar libraries.

Module Contents

Classes

URL

Lightweight URL class for manipulating and parsing URLs.

Functions

joinpaths

Returns joined paths but makes sure all paths are included in the final path rather than os.path.join

Data

__author__

__email__

API

exception duck.utils.urlcrack.InvalidPortError[source]

Bases: Exception

Raised when the port of the URL is invalid.

Initialization

Initialize self. See help(type(self)) for accurate signature.

exception duck.utils.urlcrack.InvalidURLAuthorityError[source]

Bases: Exception

Raised when the authority (netloc) of the URL is invalid.

Initialization

Initialize self. See help(type(self)) for accurate signature.

exception duck.utils.urlcrack.InvalidURLError[source]

Bases: Exception

Raised when the URL is invalid or improperly formatted.

Initialization

Initialize self. See help(type(self)) for accurate signature.

exception duck.utils.urlcrack.InvalidURLPathError[source]

Bases: Exception

Raised when the URL path is invalid or does not meet expected criteria.

Initialization

Initialize self. See help(type(self)) for accurate signature.

class duck.utils.urlcrack.URL(url: str, normalize_url: bool = True, normalization_ignore_chars: Optional[List[str]] = None)[source]

Lightweight URL class for manipulating and parsing URLs.

This class works on urls without scheme unlike urllib.parse and other libraries.

Initialization

__repr__()[source]

Returns a string representation of the URL.

Returns:

String representation of the URL.

Return type:

str

__slots__

None

build_url_string(scheme: Optional[str] = None, netloc: Optional[str] = None, path: Optional[str] = None, query: Optional[str] = None, fragment: Optional[str] = None) str[source]

Converts the current URL object to string.

property host: Optional[str]

Returns the host (excluding port) from the URL object.

innerjoin(head_url: str, normalize_url: bool = True, normalization_ignore_chars: Optional[List[str]] = None) duck.utils.urlcrack.URL[source]

Join the current URL with the provided head_url, and update the current URL object in-place.

Parameters:
  • head_url – The relative or absolute URL segment to join with the current URL.

  • normalize_url – Whether to normalize the url.

  • normalization_ignore_chars – List of characters to ignore when normalizing the url path. By default, all unsafe characters are stripped.

Behavior:

  • Performs a URL join operation similar to urllib.parse.urljoin.

  • The resulting URL replaces the current URL in this object.

  • Useful for modifying the current object without creating a new instance.

Returns:

The current URL object with the updated value.

Return type:

self

property is_absolute: bool

Returns boolean on whether this URL is an absolute URL.

join(head_url: str, normalize_url: bool = True, normalization_ignore_chars: Optional[List[str]] = None) duck.utils.urlcrack.URL[source]

Join the current URL with the provided head_url, and return a new URL object.

Parameters:
  • head_url – The relative or absolute URL segment to join with the current URL.

  • normalize_url – Whether to normalize the url.

  • normalization_ignore_chars – List of characters to ignore when normalizing the url path. By default, all unsafe characters are stripped.

Behavior:

  • Performs a URL join operation similar to urllib.parse.urljoin.

  • Unlike innerjoin(), this does not modify the current object.

  • Returns a new instance with the resulting joined URL.

Returns:

A new URL object with the combined URL.

Return type:

URL

classmethod normalize_url(url: str, ignore_chars: Optional[List[str]] = None)[source]

Normalizes a URL by removing consecutive slashes, adding a leading slash, removing trailing slashes, removing disallowed characters, e.g “<”, string quotes (etc), replacing back slashes and lowercasing the scheme.

classmethod normalize_url_path(url_path: str, ignore_chars: Optional[List[str]] = None)[source]

This normalizes the URL path.

parse(url: str, normalize_url: bool = True, normalization_ignore_chars: Optional[List[str]] = None)[source]

Parse URL from a string.

Parameters:
  • normalize_url – Whether to normalize the URL e.g: https://// \google.com>}////path?q`=some_query``; => https://google.com/path?q=some_query

  • normalization_ignore_chars – List of characters to ignore when normalizing the url path. By default, all unsafe characters are stripped.

Expected input:

scheme://some-site.com/path/...
scheme://some-site/...
some-site.com/...
/some-path/...
property port: Optional[int]

Returns the port from the URL object.

split_host_and_port(authority: str, convert_port_to_int: bool = True) Tuple[str, Union[str, int]][source]

Returns the host and port from authority (netloc).

Parameters:
  • authority – The authority or netloc (usually in form ‘some-host:port’)

  • convert_port_to_int – Whether to automatically convert port to integer (only if port found). Defaults to True.

Returns:

Tuple containing host and port.

Return type:

Tuple

split_path_components(url_path: str) Tuple[str, str, str][source]

Returns the path components from a url path.

Returns:

The tuple containg path, query and fragment.

Return type:

Tuple

split_scheme_and_authority(url: str) Tuple[str, str, str][source]

Returns the scheme, authority (netloc) and leftover (which might be the path most of the time) from a valid URL.

Returns:

A tuple containing scheme, netloc and leftover (mostly the path).

Return type:

Tuple

to_str() str[source]
classmethod urljoin(base_url: str, head_url: str, replace_authority: bool = False, full_path_replacement: bool = True, normalize_urls: bool = True, normalization_ignore_chars: Optional[List[str]] = None) str[source]

Joins 2 URLs and return the result.

… admonition:: Notes

If both URLs has schemes, The new URL will contain the base URL scheme.

Parameters:
  • base_url – The base URL

  • head_url – The URL or URL path to concanetate to the base URL

  • replace_netloc – Whether to replace URL authority (netloc). If head url has a netloc, it will be the final netloc and this also replaces the final scheme if it is present in head URL. Defaults to False.

  • full_path_replacement – This means whether to replace the query and fragment even if they are empty in head URL. Defaults to True.

  • nomalize_urls – Whether to normalize urls.

  • normalization_ignore_chars – List of characters to ignore when normalizing the url path. By default, all unsafe characters are stripped.

… rubric:: Example

property user_info: Optional[str]

Returns the user info like username@passwd in URL.

duck.utils.urlcrack.__author__

‘Brian Musakwa’

duck.utils.urlcrack.__email__

‘digreatbrian@gmail.com’

duck.utils.urlcrack.joinpaths(path1: str, path2: str, *more)[source]

Returns joined paths but makes sure all paths are included in the final path rather than os.path.join