compshs.utils package

Subpackages

Submodules

compshs.utils.check module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

compshs.utils.check.check_exist_column_name(connection: Connection, table_name: str, column_name: str) bool[source]

Check whether a column exist in a table.

compshs.utils.check.check_exist_table_name(connection: Connection, table_name: str) bool[source]

Check whether a table exist in a database.

compshs.utils.check.check_sql_identifier(identifier: str) str[source]

Ensure that an SQL identifier (table or column name) is safe to use in queries.

Parameters

identifier (str) – Identifier name to check (column or table name).

Returns

Identifier if valid.

Return type

str

compshs.utils.check.check_sql_identifiers(identifiers: Tuple[str, ...]) Tuple[str, ...][source]

Ensure that a list of SQL identifiers (table or column names) is safe to use in queries.

Parameters

identifiers (List) – List of identifier names to check (column or table names).

Returns

List of identifiers if valid.

Return type

list

compshs.utils.check.load_lang(lang: str = 'en_core_web_sm')[source]

Load (trained) Spacy pipeline.

Parameters

lang (str) – Spacy pipeline name (default is the english pipeline 'en_core_web_sm').

Return type

Trained spacy pipeline, otherwise blank minimal pipeline.

compshs.utils.metrics module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

compshs.utils.metrics.average_pairwise_similarity(values_source, values_target) float[source]

Average pairwise similarity between two arrays of values.

Given two arrays of values \(I,J\), average pairwise similarity, denoted with \(psim(I,J)\) is computed as:

\[psim(I,J)=\dfrac{\sum_{i\in I}\sum_{j \in J}sim(i,j)}{|I||J|}\]
Parameters
  • value_source – Array of values.

  • value_target – Array of values.

Returns

Average pairwise similarity.

Return type

float

compshs.utils.metrics.coherence(corpus: list, word_sets: dict) float[source]

Coherence over topics defined by word_sets.

Parameters
  • corpus (list) – Corpus of documents.

  • word_sets (dict) – Topic index as keys, word sets as values.

Returns

Overall topic coherence.

Return type

float

compshs.utils.metrics.diversity(top_words: dict) float[source]

Diversity over topics.

Parameters

top_words (dict) – Topic index as keys, top-word sets as values.

Returns

Diversity.

Return type

float

compshs.utils.metrics.topic_coherence(topic_words, dtm, vocab_index)[source]

Topic coherence as average NPMI over characteristic words of the topic.

compshs.utils.rank module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

compshs.utils.rank.extract_top_words(viz_data, n_topics: int, lambdas: array, k: int) dict[source]
Extract top words for each topics in viz_data.

Use relevance metric to select top_words.

Parameters
  • viz_data – Output from pyLDAvis library.

  • n_topics (int) – Number of topics.

  • lambdas (np.array) – Array of lamba values for relevance formula.

  • k (int) – Top-k words are selected.

Returns

Dictionary with topic number as key and top words as values.

Return type

dict

compshs.utils.rank.top_k(values: ndarray, k: int = 1) ndarray[source]

Returns indices of the k highest values.

Parameters
  • values (np.ndarray) – Array of values.

  • k (int) – Number of elements to return (default = 1).

Returns

Array of k indices.

Return type

np.ndarray

Module contents

utils module