pyutils.collectionz package

This subpackage contains some homegrown collections that try to emulate collections included in the Python standard library (see: https://docs.python.org/3/library/collections.html#module-collections). It ends with a ‘z’ so as not to collide with the standard library package.

Submodules

pyutils.collectionz.bidict module

The pyutils.collectionz.bidict.BiDict class is a subclass of dict that implements a bidirectional dictionary. That is, it maps each key to a value in constant time and each value back to the one or more keys it is associated with in constant time. It does this by simply storing the data twice.

Sample usage:

# Initialize with a normal dict...
third_party_wierdos = BiDict({
    'prometheus-fastapi-instrumentator': 'prometheus_fastapi_instrumentator',
    'scikit-learn': 'sklearn',
    'antlr4-python3-runtime' : 'antlr4',
    'python-dateutil': 'dateutil',
    'speechrecognition': 'speech_recognition',
    'beautifulsoup4': 'bs4',
    'python-dateutil': 'dateutil',
    'homeassistant-api': 'homeassistant_api',
})

# Use in one direction:
x = third_party_wierdos['scikit-learn']

# Use in opposite direction:
y = third_party_wierdos.inverse['python_dateutil']

# Note: type(y) is List since one value may map back to multiple keys.
class pyutils.collectionz.bidict.BiDict(*args, **kwargs)[source]

Bases: dict

A class that stores both a Mapping between keys and values and also the inverse mapping between values and their keys to allow for efficient lookups in either direction. Because it is possible to have several keys with the same value, using the inverse map returns a sequence of keys.

>>> d = BiDict()
>>> d['a'] = 1
>>> d['b'] = 2
>>> d['c'] = 2
>>> d['a']
1
>>> d.inverse[1]
['a']
>>> d.inverse[2]
['b', 'c']
>>> len(d)
3
>>> del d['c']
>>> len(d)
2
>>> d.inverse[2]
['b']

pyutils.collectionz.bst module

A binary search tree implementation.

class pyutils.collectionz.bst.BinarySearchTree[source]

Bases: object

depth() int[source]
Returns:

The max height (depth) of the tree in plies (edge distance from root) in \(O(log_2 n)\) time.

Return type:

int

>>> t = BinarySearchTree()
>>> t.depth()
0
>>> t.insert(50)
>>> t.depth()
1
>>> t.insert(65)
>>> t.depth()
2
>>> t.insert(33)
>>> t.depth()
2
>>> t.insert(2)
>>> t.insert(1)
>>> t.depth()
4
get_next_node(node: Node) Node | None[source]
Parameters:

node (Node) – the node whose next greater successor is desired

Returns:

Given a tree node, returns the next greater node in the tree. If the given node is the greatest node in the tree, returns None.

Return type:

Node | None

>>> t = BinarySearchTree()
>>> t.insert(50)
>>> t.insert(75)
>>> t.insert(25)
>>> t.insert(66)
>>> t.insert(22)
>>> t.insert(13)
>>> t.insert(23)
>>> t
50
├──25
│  └──22
│     ├──13
│     └──23
└──75
   └──66
>>> n = t[23]
>>> t.get_next_node(n).value
25
>>> n = t[50]
>>> t.get_next_node(n).value
66
>>> n = t[75]
>>> t.get_next_node(n) is None
True
get_nodes_in_range_inclusive(lower: Comparable, upper: Comparable) Generator[Node, None, None][source]
Parameters:
  • lower (Comparable) – the lower bound of the desired range.

  • upper (Comparable) – the upper bound of the desired range.

Returns:

Generates a sequence of nodes in the desired range.

Return type:

Generator[Node, None, None]

>>> t = BinarySearchTree()
>>> t.insert(50)
>>> t.insert(75)
>>> t.insert(25)
>>> t.insert(66)
>>> t.insert(22)
>>> t.insert(13)
>>> t.insert(23)
>>> t
50
├──25
│  └──22
│     ├──13
│     └──23
└──75
   └──66
>>> for node in t.get_nodes_in_range_inclusive(21, 74):
...     print(node.value)
22
23
25
50
66
get_root() Node | None[source]
Returns:

The root of the BST

Return type:

Node | None

height() int[source]

Returns the height (i.e. max depth) of the tree

Return type:

int

insert(value: Comparable) None[source]

Insert something into the tree in \(O(log_2 n)\) time.

Parameters:

value (Comparable) – the value to be inserted.

Return type:

None

>>> t = BinarySearchTree()
>>> t.insert(10)
>>> t.insert(20)
>>> t.insert(5)
>>> len(t)
3
>>> t.get_root().value
10
iterate_inorder()[source]
Returns:

A Generator that yield the tree’s items in a preorder traversal sequence.

>>> t = BinarySearchTree()
>>> t.insert(50)
>>> t.insert(75)
>>> t.insert(25)
>>> t.insert(66)
>>> t.insert(22)
>>> t.insert(13)
>>> t.insert(24)
>>> t
50
├──25
│  └──22
│     ├──13
│     └──24
└──75
   └──66
>>> for value in t.iterate_inorder():
...     print(value)
13
22
24
25
50
66
75
iterate_leaves()[source]
Returns:

A Generator that yields only the leaf nodes in the tree.

>>> t = BinarySearchTree()
>>> t.insert(50)
>>> t.insert(75)
>>> t.insert(25)
>>> t.insert(66)
>>> t.insert(22)
>>> t.insert(13)
>>> for value in t.iterate_leaves():
...     print(value)
13
66
iterate_nodes_by_depth(depth: int) Generator[Node, None, None][source]
Parameters:

depth (int) – the desired depth

Returns:

A Generator that yields nodes at the prescribed depth in the tree.

Return type:

Generator[Node, None, None]

>>> t = BinarySearchTree()
>>> t.insert(50)
>>> t.insert(75)
>>> t.insert(25)
>>> t.insert(66)
>>> t.insert(22)
>>> t.insert(13)
>>> for value in t.iterate_nodes_by_depth(2):
...     print(value)
22
66
>>> for value in t.iterate_nodes_by_depth(3):
...     print(value)
13
iterate_postorder()[source]
Returns:

A Generator that yield the tree’s items in a preorder traversal sequence.

>>> t = BinarySearchTree()
>>> t.insert(50)
>>> t.insert(75)
>>> t.insert(25)
>>> t.insert(66)
>>> t.insert(22)
>>> t.insert(13)
>>> for value in t.iterate_postorder():
...     print(value)
13
22
25
66
75
50
iterate_preorder()[source]
Returns:

A Generator that yields the tree’s items in a preorder traversal sequence.

>>> t = BinarySearchTree()
>>> t.insert(50)
>>> t.insert(75)
>>> t.insert(25)
>>> t.insert(66)
>>> t.insert(22)
>>> t.insert(13)
>>> for value in t.iterate_preorder():
...     print(value)
50
25
22
13
75
66
parent_path(node: Node) List[Node | None][source]

Get a node’s parent path in \(O(log_2 n)\) time.

Parameters:

node (Node) – the node whose parent path should be returned.

Returns:

a list of nodes representing the path from the tree’s root to the given node.

Return type:

List[Node | None]

Note

If the node does not exist in the tree, the last element on the path will be None but the path will indicate the ancestor path of that node were it to be inserted.

>>> t = BinarySearchTree()
>>> t.insert(50)
>>> t.insert(75)
>>> t.insert(25)
>>> t.insert(12)
>>> t.insert(33)
>>> t.insert(4)
>>> t.insert(88)
>>> t
50
├──25
│  ├──12
│  │  └──4
│  └──33
└──75
   └──88
>>> n = t[4]
>>> for x in t.parent_path(n):
...     print(x.value)
50
25
12
4
>>> del t[4]
>>> for x in t.parent_path(n):
...     if x is not None:
...         print(x.value)
...     else:
...         print(x)
50
25
12
None
repr_traverse(padding: str, pointer: str, node: Node | None, has_right_sibling: bool) str[source]
Parameters:
  • padding (str) –

  • pointer (str) –

  • node (Node | None) –

  • has_right_sibling (bool) –

Return type:

str

class pyutils.collectionz.bst.Node(value: Comparable)[source]

Bases: object

A BST node. Just a left and right reference along with a value. Note that value can be anything as long as it is Comparable with other instances of itself.

Parameters:

value (Comparable) – a reference to the value of the node. Must be Comparable to other values.

pyutils.collectionz.interval_tree module

This is an augmented interval tree for storing ranges and identifying overlaps as described by: https://en.wikipedia.org/wiki/Interval_tree.

class pyutils.collectionz.interval_tree.AugmentedIntervalTree[source]

Bases: BinarySearchTree

find_all_overlaps(to_find: NumericRange) Generator[NumericRange, None, None][source]

Yields ranges previously added to the tree that overlaps with to_find argument.

Parameters:

to_find (NumericRange) – the interval with which to find all overlaps.

Returns:

A (potentially empty) sequence of all ranges in the tree that overlap with the argument.

Return type:

Generator[NumericRange, None, None]

>>> tree = AugmentedIntervalTree()
>>> tree.insert(NumericRange(20, 24))
>>> tree.insert(NumericRange(18, 22))
>>> tree.insert(NumericRange(14, 16))
>>> tree.insert(NumericRange(1, 30))
>>> tree.insert(NumericRange(25, 30))
>>> tree.insert(NumericRange(29, 33))
>>> tree.insert(NumericRange(5, 12))
>>> tree.insert(NumericRange(1, 6))
>>> tree.insert(NumericRange(13, 18))
>>> tree.insert(NumericRange(16, 28))
>>> tree.insert(NumericRange(21, 27))
>>> for x in tree.find_all_overlaps(NumericRange(19, 21)):
...     print(x)
[20..24]
[18..22]
[1..30]
[16..28]
[21..27]
>>> del tree[NumericRange(1, 30)]
>>> for x in tree.find_all_overlaps(NumericRange(19, 21)):
...     print(x)
[20..24]
[18..22]
[16..28]
[21..27]
find_one_overlap(to_find: NumericRange) NumericRange | None[source]

Identify and return one overlapping node from the tree.

Parameters:

to_find (NumericRange) – the interval with which to find an overlap.

Returns:

An overlapping range from the tree or None if no such ranges exist in the tree at present.

Return type:

NumericRange | None

>>> tree = AugmentedIntervalTree()
>>> tree.insert(NumericRange(20, 24))
>>> tree.insert(NumericRange(18, 22))
>>> tree.insert(NumericRange(14, 16))
>>> tree.insert(NumericRange(1, 30))
>>> tree.insert(NumericRange(25, 30))
>>> tree.insert(NumericRange(29, 33))
>>> tree.insert(NumericRange(5, 12))
>>> tree.insert(NumericRange(1, 6))
>>> tree.insert(NumericRange(13, 18))
>>> tree.insert(NumericRange(16, 28))
>>> tree.insert(NumericRange(21, 27))
>>> tree.find_one_overlap(NumericRange(6, 7))
[1..30]
class pyutils.collectionz.interval_tree.NumericRange(low: int | float, high: int | float)[source]

Bases: Comparable

Essentially a tuple of numbers denoting a range with some added helper methods on it.

Creates a NumericRange.

Parameters:
  • low (Numeric) – the lowest point in the range (inclusive).

  • high (Numeric) – the highest point in the range (inclusive).

Warning

If low > high this code swaps the parameters and keeps the range rather than raising.

overlaps_with(other: NumericRange) bool[source]
Returns:

True if this NumericRange overlaps with other, else False.

Parameters:

other (NumericRange) –

Return type:

bool

pyutils.collectionz.shared_dict module

The shared_dict.SharedDict class is a normal python dictionary that can be accessed safely in parallel from multiple threads or processes without (external) locking by using Multiprocessing.SharedMemory. It uses internal locking and rewrites the shared memory region as it is changed so it is slower than a normal dict. It also does not grow dynamically; the creator of the shared_dict must declare a maximum size.

The MIT License (MIT)

Copyright (c) 2020 LuizaLabs

Additions/Modifications Copyright (c) 2022 Scott Gasch

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

This class is based on https://github.com/luizalabs/shared-memory-dict. For details about what is preserved from the original and what was changed by Scott, see NOTICE at the root of this module.

class pyutils.collectionz.shared_dict.PickleSerializer[source]

Bases: object

A serializer that uses pickling. Used to read/write bytes in the shared memory region and interpret them as a dict.

dumps(obj: Dict[Hashable, Any]) bytes[source]
Parameters:

obj (Dict[Hashable, Any]) –

Return type:

bytes

loads(data: bytes) Dict[Hashable, Any][source]
Parameters:

data (bytes) –

Return type:

Dict[Hashable, Any]

class pyutils.collectionz.shared_dict.SharedDict(name: str | None = None, size_bytes: int | None = None)[source]

Bases: Closable

This class emulates the dict container but uses a Multiprocessing.SharedMemory region to back the dict such that it can be read and written by multiple independent processes at the same time. Because it constantly de/re-serializes the dict, it is much slower than a normal dict.

Example usage… one process should set up the shared memory:

from pyutils.collectionz.shared_dict import SharedDict

shared_memory_id = 'SharedDictIdentifier'
shared_memory_size_bytes = 4096
shared_memory = SharedDict(shared_memory_id, shared_memory_size_bytes)

Other processes can then attach to the shared memory by referencing its name. Don’t try to pass the SharedDict itself to a child process. Rather, just pass its name string. You can create child processes any way that Python supports. The wordle example uses the parallelize framework with SharedDict but a simple subprocess.run, exec_utils, ProcessExecutor, whatever:

from pyutils import exec_utils

processes = []
for i in range(10):
    processes.append(
        exec_utils.cmd_in_background(
            f'myhelper.py --number {i} --shared_memory={shared_memory_id}'
        )
    )

In the child process, attach the already created SharedDict using its name. A size is not necessary when attaching to an already created shared memory region – it cannot be resized after creation. The name must be the same exact name that was used to create it originally:

from pyutils.collectionz.shared_dict import SharedDict

shared_memory_id = config.config['shared_memory']
shared_memory = SharedDict(shared_memory_id)

The children processes (and parent process, also) can now just use the shared memory like a normal dict:

if shared_memory[work_id] is None:
    result = do_expensive_work(work_id)
    shared_memory[work_id] = result

Note

It’s pretty slow to mutate data in the shared memory. First, it needs to acquire an exclusive lock. Second, it essentially pickles an entire dict into the shared memory region. So this is not a data structure that is going to win awards for speed. But it is a very convenient way to have a shared cache, for example. See the wordle example for a real life program using SharedDict this way. It basically saves the result of large computations in a SharedDict thereby allowing all threads to avoid recomputing that same expensive computation. In this scenario the slowness of the dict writes are more than paid for by the avoidence of duplicated, expensive work.

Finally, someone (likely the main process) should call the cleanup() method when the shared memory region is no longer needed:

shared_memory.cleanup()

See also the shared_dict_test.py for an example of using this class.

Creates or attaches a shared dictionary back by a SharedMemory buffer. For create semantics, a unique name (string) and a max dictionary size (expressed in bytes) must be provided. For attach semantics size is ignored.

Warning

Size is ignored on attach operations. The size of the shared memory region cannot be changed once it has been created.

The first process that creates the SharedDict is responsible for (optionally) naming it and deciding the max size (in bytes) that it may be. It does this via args to the c’tor.

Subsequent processes may safely the size arg.

Parameters:
  • name (str | None) – the name of the shared dict, only required for initial caller

  • size_bytes (int | None) – the maximum size of data storable in the shared dict, only required for the first caller.

LOCK = <RLock(None, 0)>
NULL_BYTE = b'\x00'
cleanup() None[source]

Unlink the shared dict and memory behind it. Only the last process should invoke this. Not called automatically.

Return type:

None

clear() None[source]

Clears the shared dict.

Return type:

None

close() None[source]

Unmap the shared dict and memory behind it from this process. Called by automatically __del__().

Return type:

None

copy() Dict[Hashable, Any][source]
Returns:

A shallow copy of the shared dict.

Return type:

Dict[Hashable, Any]

get(key: str, default: Any | None = None) Any[source]
Parameters:
  • key (str) – the key to lookup

  • default (Any | None) – the value returned if key is not present

Returns:

The value associated with key or a default.

Return type:

Any

get_name()[source]
Returns:

The name of the shared memory buffer backing the dict.

items() ItemsView[Hashable, Any][source]
Return type:

ItemsView[Hashable, Any]

keys() KeysView[Hashable][source]
Return type:

KeysView[Hashable]

pop(key: Hashable, default: Any | None = None) Any[source]

Remove and return the value associated with key or a default

Parameters:
  • key (Hashable) –

  • default (Any | None) –

Return type:

Any

popitem() Tuple[Hashable, Any][source]

Remove and return the last added item.

Return type:

Tuple[Hashable, Any]

setdefault(key: Hashable, default: Any | None = None)[source]
Parameters:
  • key (Hashable) –

  • default (Any | None) –

update(other=(), /, **kwds)[source]
values() ValuesView[Any][source]
Return type:

ValuesView[Any]

pyutils.collectionz.trie module

This module contains the implementation of a Trie tree (or prefix tree). See: https://en.wikipedia.org/wiki/Trie.

It can be used with arbitrary sequences as keys and stores its values in a tree with paths determined by the sequence determined by each keys’ sequences. Thus, it can determine whether a given value is contained in the tree via a simple traversal in \(O(n)\) where n is the number of steps in the item’s sequence and can also check whether a key-prefix is present in the tree in \(O(n)\) time.

Given a node in the BST, it is easy to determine all items that are stored beneath that node. See examples below.

class pyutils.collectionz.trie.Trie[source]

Bases: object

This is a Trie class, see: https://en.wikipedia.org/wiki/Trie.

It attempts to follow Pythonic container patterns. See doctests for examples.

Create an empty trie.

contains_prefix(item: Sequence[Any])[source]

Check whether a prefix is in the Trie. The prefix may or may not be a full item.

Parameters:

item (Sequence[Any]) – the item describing the prefix to be checked.

Returns:

True if the prefix described by item is present, False otherwise.

>>> t = Trie()
>>> t.insert('tricycle')
>>> t.contains_prefix('tri')
True
>>> t.contains_prefix('tricycle')
True
>>> t.contains_prefix('triad')
False
delete_recursively(node: Dict[Any, Any], item: Sequence[Any]) bool[source]

Deletes an item from the trie by walking the path from root to where it ends.

Parameters:
  • node (Dict[Any, Any]) – root under which to search for item

  • item (Sequence[Any]) – item whose node is the root of the recursive deletion operation

Returns:

True if the item was not the prefix of another item such that there is nothing below item remaining anymore post delete and False if the deleted item was a proper prefix of another item in the tree such that there is still data below item remaining in the tree.

Return type:

bool

generate_recursively(node, path: Sequence[Any])[source]
Returns:

A generator that yields the trie’s items one at a time.

Parameters:

path (Sequence[Any]) –

>>> t = Trie()
>>> t.insert('test')
>>> t.insert('tickle')
>>> for item in t.generate_recursively(t.root, ''):
...     print(item)
test
tickle
insert(item: Sequence[Any]) None[source]

Insert an item into the trie. Items are represented by a Sequence where each item in the sequence describes a set in the path to the item.

Parameters:

item (Sequence[Any]) – the item to be inserted.

Return type:

None

>>> t = Trie()
>>> t.insert('test')
>>> t.__contains__('test')
True
repr_brief(node: Dict[Any, Any], delimiter: str)[source]

A friendly string representation of the contents of the Trie.

Parameters:
  • node (Dict[Any, Any]) – the root of the trie to represent.

  • delimiter (str) – character or string to stuff between edges.

Returns:

A brief string representation of the trie. See example.

>>> t = Trie()
>>> t.insert([10, 0, 0, 1])
>>> t.insert([10, 0, 0, 2])
>>> t.insert([10, 10, 10, 1])
>>> t.insert([10, 10, 10, 2])
>>> t.repr_brief(t.root, '.')
'10.[0.0.[1,2],10.10.[1,2]]'
successors(item: Sequence[Any])[source]
Returns:

A list of the successors of an item.

Parameters:

item (Sequence[Any]) –

>>> t = Trie()
>>> t.insert('what')
>>> t.insert('who')
>>> t.insert('when')
>>> t.successors('wh')
['a', 'o', 'e']
>>> u = Trie()
>>> u.insert(['this', 'is', 'a', 'test'])
>>> u.insert(['this', 'is', 'a', 'robbery'])
>>> u.insert(['this', 'is', 'a', 'walrus'])
>>> u.successors(['this', 'is', 'a'])
['test', 'robbery', 'walrus']

Module contents