pyutils.search package
Submodules
pyutils.search.logical_search module
This is a module concerned with the creation of and searching of a corpus of documents. The corpus and index are held in memory. The query language contains AND, OR, NOT, and parenthesis to support flexible search semantics.
- class pyutils.search.logical_search.Corpus[source]
Bases:
object
A collection of searchable documents. The caller can add documents to it (or edit existing docs) via
add_doc()
, retrieve a document given its docid viaget_doc()
, and perform various lookups of documents. The most interesting lookup is implemented inquery()
.>>> c = Corpus() >>> c.add_doc(Document( ... docid=1, ... tags=set(['urgent', 'important']), ... properties=[ ... ('author', 'Scott'), ... ('subject', 'your anniversary') ... ], ... reference=None, ... ) ... ) >>> c.add_doc(Document( ... docid=2, ... tags=set(['important']), ... properties=[ ... ('author', 'Joe'), ... ('subject', 'your performance at work') ... ], ... reference=None, ... ) ... ) >>> c.add_doc(Document( ... docid=3, ... tags=set(['urgent']), ... properties=[ ... ('author', 'Scott'), ... ('subject', 'car turning in front of you') ... ], ... reference=None, ... ) ... ) >>> c.query('author:Scott and important') {1} >>> c.query('*') {1, 2, 3} >>> c.query('*:*') {1, 2, 3} >>> c.query('*:Scott') {1, 3}
- add_doc(doc: Document) None [source]
Add a new Document to the Corpus. Each Document must have a distinct docid that will serve as its primary identifier. If the same Document is added multiple times, only the most recent addition is indexed. If two distinct documents with the same docid are added, the latter klobbers the former in the indexes. See
python_modules.id_generator.get()
orpython_modules.string_utils.generate_uuid()
for potential sources of docids.Each Document may have an optional set of tags which can be used later in expressions to the query method. These are simple text labels.
Each Document may have an optional list of key->value tuples which can be used later in expressions to the query method.
Document includes a user-defined “reference” field which is never interpreted by this module. This is meant to allow easy mapping between Documents in this corpus and external objects they may represent.
- Parameters:
doc (Document) – the document to add or edit
- Return type:
None
- get_doc(docid: str) Document | None [source]
Given a docid, retrieve the previously added Document.
- Parameters:
docid (str) – the docid to retrieve
- Returns:
The Document with docid or None to indicate no match.
- Return type:
Document | None
- get_docids_by_exact_tag(tag: str) Set[str] [source]
Return the set of docids that have a particular tag.
- Parameters:
tag (str) – the tag for which to search
- Returns:
A set containing docids with the provided tag which may be empty.
- Return type:
Set[str]
- get_docids_by_property(key: str, value: str) Set[str] [source]
Return the set of docids that have a particular property with a particular value.
- Parameters:
key (str) – the key to search for
value (str) – the value that key must have in order to match a doc.
- Returns:
A set of docids that contain key with value which may be empty.
- Return type:
Set[str]
- get_docids_by_searching_tags(tag: str) Set[str] [source]
Return the set of docids with a tag that contains a str.
- Parameters:
tag (str) – the tag pattern for which to search
- Returns:
A set containing docids with tags that match the pattern provided. e.g., if the arg was “foo” tags “football”, “foobar”, and “food” all match.
- Return type:
Set[str]
- get_docids_with_property(key: str) Set[str] [source]
Return the set of docids that have a particular property no matter what that property’s value.
- Parameters:
key (str) – the key value to search for.
- Returns:
A set of docids that contain the key (no matter what value) which may be empty.
- Return type:
Set[str]
- invert_docid_set(original: Set[str]) Set[str] [source]
Invert a set of docids.
- Parameters:
original (Set[str]) –
- Return type:
Set[str]
- query(query: str) Set[str] | None [source]
Query the corpus for documents that match a logical expression.
- Parameters:
query (str) –
the logical query expressed using a simple language that understands conjunction (and operator), disjunction (or operator) and inversion (not operator) as well as parenthesis. Here are some legal sample queries:
tag1 and tag2 and not tag3 (tag1 or tag2) and (tag3 or tag4) (tag1 and key2:value2) or (tag2 and key1:value1) key:* tag1 and key:*
- Returns:
A (potentially empty) set of docids for the matching (previously added) documents or None on error.
- Return type:
Set[str] | None
- class pyutils.search.logical_search.Document(docid: str = '', tags: ~typing.Set[str] = <factory>, properties: ~typing.List[~typing.Tuple[str, str]] = <factory>, reference: ~typing.Any | None = None)[source]
Bases:
object
A class representing a searchable document.
- Parameters:
docid (str) –
tags (Set[str]) –
properties (List[Tuple[str, str]]) –
reference (Any | None) –
- docid: str = ''
A unique identifier for each document – must be provided by the caller. See
python_modules.id_generator.get()
orpython_modules.string_utils.generate_uuid()
for potential sources.
- properties: List[Tuple[str, str]]
A list of key->value strings for this document. May be empty. Properties are more flexible tags that have both a label and a value. e.g. “category:mystery” or “author:smith”.
- reference: Any | None = None
An optional reference to something else for convenience; interpreted only by caller code, ignored here.
- tags: Set[str]
A set of tag strings for this document. May be empty. Tags are simply text labels that are associated with a document and may be used to search for it later.
- class pyutils.search.logical_search.Node(corpus: Corpus, op: Operation, operands: Sequence[Node | str])[source]
Bases:
object
A query AST node.