compshs.utils package

Subpackages

compshs.utils.tests package

Submodules

compshs.utils.check module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

compshs.utils.check.check_exist_column_name(connection: Connection, table_name: str, column_name: str) → bool[source]: Check whether a column exist in a table.

compshs.utils.check.check_exist_table_name(connection: Connection, table_name: str) → bool[source]: Check whether a table exist in a database.

compshs.utils.check.check_sql_identifier(identifier: str) → str[source]

Ensure that an SQL identifier (table or column name) is safe to use in queries.

Parameters: identifier (str) – Identifier name to check (column or table name).
Returns: Identifier if valid.
Return type: str

compshs.utils.check.check_sql_identifiers(identifiers: Tuple[str, ...]) → Tuple[str, ...][source]

Ensure that a list of SQL identifiers (table or column names) is safe to use in queries.

Parameters: identifiers (List) – List of identifier names to check (column or table names).
Returns: List of identifiers if valid.
Return type: list

compshs.utils.check.load_lang(lang: str = 'en_core_web_sm')[source]

Load (trained) Spacy pipeline.

Parameters: lang (str) – Spacy pipeline name (default is the english pipeline 'en_core_web_sm').
Return type: Trained spacy pipeline, otherwise blank minimal pipeline.

compshs.utils.metrics module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

compshs.utils.metrics.average_pairwise_similarity(values_source, values_target) → float[source]

Average pairwise similarity between two arrays of values.

Given two arrays of values \(I,J\), average pairwise similarity, denoted with \(psim(I,J)\) is computed as:

\[psim(I,J)=\dfrac{\sum_{i\in I}\sum_{j \in J}sim(i,j)}{|I||J|}\]

Parameters

value_source – Array of values.
value_target – Array of values.

Returns

Average pairwise similarity.

Return type

float

compshs.utils.metrics.coherence(corpus: list, word_sets: dict) → float[source]

Coherence over topics defined by word_sets.

Parameters

corpus (list) – Corpus of documents.
word_sets (dict) – Topic index as keys, word sets as values.

Returns

Overall topic coherence.

Return type

float

compshs.utils.metrics.diversity(top_words: dict) → float[source]

Diversity over topics.

Parameters: top_words (dict) – Topic index as keys, top-word sets as values.
Returns: Diversity.
Return type: float

compshs.utils.metrics.topic_coherence(topic_words, dtm, vocab_index)[source]: Topic coherence as average NPMI over characteristic words of the topic.

compshs.utils.rank module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

compshs.utils.rank.extract_top_words(viz_data, n_topics: int, lambdas: array, k: int) → dict[source]

Extract top words for each topics in viz_data.: Use relevance metric to select top_words.

Parameters

viz_data – Output from pyLDAvis library.
n_topics (int) – Number of topics.
lambdas (np.array) – Array of lamba values for relevance formula.
k (int) – Top-k words are selected.

Returns

Dictionary with topic number as key and top words as values.

Return type

dict

compshs.utils.rank.top_k(values: ndarray, k: int = 1) → ndarray[source]

Returns indices of the k highest values.

Parameters

values (np.ndarray) – Array of values.
k (int) – Number of elements to return (default = 1).

Returns

Array of k indices.

Return type

np.ndarray

Module contents

utils module