Utils

Metrics

compshs.utils.diversity(top_words: dict) float[source]

Diversity over topics.

Parameters

top_words (dict) – Topic index as keys, top-word sets as values.

Returns

Diversity.

Return type

float

compshs.utils.coherence(corpus: list, word_sets: dict) float[source]

Coherence over topics defined by word_sets.

Parameters
  • corpus (list) – Corpus of documents.

  • word_sets (dict) – Topic index as keys, word sets as values.

Returns

Overall topic coherence.

Return type

float

compshs.utils.average_pairwise_similarity(values_source, values_target) float[source]

Average pairwise similarity between two arrays of values.

Given two arrays of values \(I,J\), average pairwise similarity, denoted with \(psim(I,J)\) is computed as:

\[psim(I,J)=\dfrac{\sum_{i\in I}\sum_{j \in J}sim(i,j)}{|I||J|}\]
Parameters
  • value_source – Array of values.

  • value_target – Array of values.

Returns

Average pairwise similarity.

Return type

float

Rank

compshs.utils.top_k(values: ndarray, k: int = 1) ndarray[source]

Returns indices of the k highest values.

Parameters
  • values (np.ndarray) – Array of values.

  • k (int) – Number of elements to return (default = 1).

Returns

Array of k indices.

Return type

np.ndarray

compshs.utils.extract_top_words(viz_data, n_topics: int, lambdas: array, k: int) dict[source]
Extract top words for each topics in viz_data.

Use relevance metric to select top_words.

Parameters
  • viz_data – Output from pyLDAvis library.

  • n_topics (int) – Number of topics.

  • lambdas (np.array) – Array of lamba values for relevance formula.

  • k (int) – Top-k words are selected.

Returns

Dictionary with topic number as key and top words as values.

Return type

dict