compshs.semantics package

Subpackages

Submodules

compshs.semantics.base module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

class compshs.semantics.base.BaseSemanticShift[source]

Bases: object

Base class for Semantic Shift Detection.

Note: Semantic shift detection requires contextual word embeddings with additional attributes, e.g. time information.

compute_metric_for_timepair(embeddings, timedate_source, timedate_target, attribute_x, attribute_y, keywords_strategy, embeddings_strategy, metric_name) list[source]

Generic computation for metrics between embeddings over time.

Parameters
  • embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.

  • timedate_source (str) – Source timedate

  • timedate_target (str) – Target timedate

  • attribute_x (str) – First attribute

  • attribute_y (str) – Second attribute

  • keywords_strategy – Strategy for selecting keyword intersection at different timedates.

  • embeddings_strategy – Strategy for selecting embeddings at different timedates.

  • metric_name (str) – Metric name.

Return type

List of dictionaries with metric information for all keywords.

get_common_attributes(embeddings: list, timedate_x: str, timedate_y: str) set[source]

Compute the set of common attributes between subcorporas at two different timesteps.

get_common_keywords(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str) set[source]

Compute the set of common keywords between subcorporas at two timesteps and for two attributes.

get_group(keyword, groups, group_names)[source]
group_embeddings(embeddings: list, timedates: ndarray, attributes: ndarray) dict[source]

Group dictionaries of contextual embeddings by: - time - attribute - keyword

Parameters
  • embeddings (list) – List of dictionaries of embeddings in the form of ContextualEmbedding.transform() output.

  • timedates (np.ndarray) – Array of time values.

  • attributes (np.ndarray) – Array of attribute values.

Return type

Single dictionary of keyword embeddings indexed by timedate, attribute, keyword.

group_output_fixed(similarities: DataFrame, groups: list, group_names: list) Tuple[DataFrame, list][source]

Group approaching-based semantic detection output for fixed mode.

Parameters
  • similarities (pd.DataFrame) – DataFrame containing approaching-based semantic similarities (see shift.py classes).

  • groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.

  • group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Return type

Tuple of grouped similarities for approaching-based semantic detection methods and ordered keywords list.

group_output_sequential(similarities: DataFrame, metric: str, groups: list, group_names: list) DataFrame[source]

Group approaching-based semantic detection output for sequential mode.

Parameters
  • similarities (pd.DataFrame) – DataFrame containing approaching-based semantic similarities (see shift.py classes).

  • metric (str) – Metric name.

  • groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.

  • group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Return type

Grouped similarities for approaching-based semantic detection methods.

compshs.semantics.concept_induction module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

class compshs.semantics.concept_induction.ConceptInduction(local_params: dict, global_params: dict, distance: str = 'cosine')[source]

Bases: object

Concept Induction (as a generalization of Word Sense Induction) framework from Lietard, et al. 2024.

Original implementation is available at https://github.com/blietard/concept-induction.

Note: This class implements the bi-level approach.

Parameters
  • local_params (dict) –

    Dictionary of local parameters. Keys must contain:
    • mode: Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm for local clustering (str, default = 'average').

    • nu: Value of lobal \(\nu\) hyperparameter (int)

  • global_params (dict) –

    Dictionary of global parameters. Keys must contain:
    • mode: Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm for local clustering (str, default = 'average'.

    • nu: Value of global \(\nu\) hyperparameter (int)

  • distance (str) – Metric used for computing distance between clustered instances (default= cosine).

References

Liétard, B., Denis, P., & Keller, M. (2024). To word senses and beyond: Inducing concepts with contextualized language models. arXiv preprint arXiv:2406.20054.

average_senses(sense_clusters: dict) tuple[source]

Average embeddings in local sense clusters.

Parameters

sense_clusters (dict) – Dictionary of sense clusters.

Returns

Array of average senses in clusters, original corresponding list of keywords.

Return type

np.ndarray

get_concept_clusters(senses, keywords_origin: list, nu: int, distance: str = 'cosine', mode: str = 'average')[source]

Compute globally estimated concept clusters.

Formally, a concept \(c_k\) is a cluster of senses, and \(C={c_k}_{1 \leq k \leq p}\) is a partition of \(O\) in \(p\) concept clusters.

Parameters
  • senses

  • keywords_origin (list) –

  • nu (int:) – Hyperparameter used in \(\tau = avg(d) - \nu \times std(d)\), with \(d\) the distribution of distances between clustered instances.

  • distance (str) – Distance metric computed between senses (default = 'cosine').

  • mode (str) – Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm (default = 'average').

Returns

A list of concepts as a soft clustering over original keywords.

Return type

list

get_sense_clusters(keyword_embeddings: dict, nu: int, distance: str = 'cosine', mode: str = 'average') dict[source]

Compute locally estimated sense clusters (using sklearn AgglomerativeClustering algorithm).

Formally, from the set of contextual occurrences of keyword \(w\), denoted \(O^w\), it computes a partition \(S^w = \{ s_j^{w} \}_{1 \leq j \leq n_w}\).

The set of all sense clusters of all keywords is the union of all partitions, \(\bigcup_{w \in W} S^{w}\).

Parameters
  • keyword_embeddings (dict) – Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.

  • nu (int) – Hyperparameter used in \(\tau = avg(d) - \nu std(d)\), with \(d\) the distribution of distances between clustered instances.

  • distance (str) – Distance metric computed between keyword embeddings (default = 'cosine').

  • mode (str) – Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm (default = 'average').

Returns

Dictionary of sense clusters, with keywords as keys and lists of contextual embeddings as values.

Return type

dict

group_embeddings(embeddings: list) dict[source]

Group a list of contextual embeddings by keywords. Using Lietard, et al. wording, it corresponds to contextutal occurrences of keywords.

Parameters

embeddings (list) – Pre-computed contextual word embeddings stored in a list of \(n\) elements, with \(n\) the number of documents in corpus.

Returns

Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.

Return type

dict

transform(embeddings, group_by_keywords: bool = True)[source]

Perform concept induction given precomputed contextual embeddings.

Parameters
  • embeddings – Pre-computed contextual word embeddings.

  • group_by_keywords (bool) – Set to True if embeddings are in the format of ContextualEmbedding.transform() output (default).

Returns

A list of concepts as a soft clustering over original keywords.

Return type

list

compshs.semantics.shift module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

class compshs.semantics.shift.AsApp[source]

Bases: BaseSemanticShift

Asymmetric Approaching between word embeddings.

\[\begin{split}asApp(w, a) = psim(I_{w,a,t+1}, I_{w,a^{\\prime},t}) - psim(I_{w,a,t}, I_{w,a^{\\prime},t})\end{split}\]

where \(I_{w,a,t}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(t\).

A positive value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that is close to the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).

A negative value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that moves away from the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

asapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]

Returns embeddings (embeddings for both attributes at current and previous timedates, embeddings for both attributes at previous timedate).

asapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]

Compute intersection between sets of keywords at different timedates.

compute_asapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) list[source]

Compute asapp metric for a timepair.

Parameters
  • embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.

  • timedate_source (str) – Source timedate

  • timedate_target (str) – Target timedate

  • attribute_x (str) – First attribute

  • attribute_y (str) – Second attribute

Returns

List of dictionaries with metric information for all keywords.

Return type

list

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]

Compute asymmetric approaching for all available keywords and attributes in contextual embeddings.

Parameters
  • embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.

  • timedates (np.ndarray) – Array of time values.

  • attributes (np.ndarray) – Define subcorporas. Array of attribute values.

  • group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).

  • mode (str) – 'sequential' or 'fixed'.

  • time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode = 'fixed'.

  • group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.

  • groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.

  • group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of asymmetric approaching similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.shift.DS[source]

Bases: BaseSemanticShift

Driving Strength between word embeddings.

\[DS(w,a) = \dfrac{asApp(w,a)}{|asApp(w,a)|+|asApp(w,a^{\prime})|}\]

Driving Strength is an asymmetric time-aware normalised measure indicating how much of the total approaching between two subcorporas is done by one side.

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, time_pair: Optional[tuple] = None, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]

Compute driving strength metric for all available keywords and attributes in contextual embeddings.

Parameters
  • embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.

  • timedates (np.ndarray) – Array of time values.

  • attributes (np.ndarray) – Define subcorporas. Array of attribute values.

  • group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).

  • time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time).

  • groups (list) – Use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.

  • group_names (list) – Use this parameter to rename groups of keywords.

Returns

DataFrame of driving strength similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.shift.SApp[source]

Bases: BaseSemanticShift

Symmetric approaching between word embeddings.

\[sApp(w) = psim(I_{w,a,k+1}, I_{w,a^{\prime},k+1}) - psim(I_{w,a,k}, I_{w,a^{\prime},k})\]

where \(I_{w,a,k}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(k\).

A positive value for \(sApp(w)\) indicates that two subcorporas achieved a closer word semantics over time. Conversely, a negative value indicates that word representations diverged over time.

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

compute_sapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) list[source]

Compute sapp metric for a timepair.

Parameters
  • embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.

  • timedate_source (str) – Source timedate

  • timedate_target (str) – Target timedate

  • attribute_x (str) – First attribute

  • attribute_y (str) – Second attribute

Returns

List of dictionaries with metric information for all keywords.

Return type

list

sapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]

Returns embeddings (embeddings for both attributes at current timedate, embeddings for both attributes at previous timedate).

sapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]

Compute intersection between sets of keywords at different timedates.

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]

Compute symmetric approaching for all available keywords and attributes in contextual embeddings.

Parameters
  • embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.

  • timedates (np.ndarray) – Array of time values.

  • attributes (np.ndarray) – Define subcorporas. Array of attribute values.

  • group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).

  • mode (str) – 'sequential' or 'fixed'.

  • time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode = 'fixed'.

  • group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.

  • groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.

  • group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of symmetric approaching similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.shift.SSTA[source]

Bases: BaseSemanticShift

Time-aware Self Similarity between word embeddings.

\[SS_{TA}(w, a, k) = psim(I_{w,a,k}, I_{w,a,k+1})\]

where:

  • \(w\) is a word

  • \(a\) is an attribute

  • \(k\) is a timestep

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

attribute_exist_at_timedates(embeddings, attribute, timedate_source, timedate_target) bool[source]

True if attribute exists in contextual embeddings at two reference timedates.

keyword_exist_at_timedates(embeddings, keyword, attribute, timedate_source, timedate_target) bool[source]

True if keyword exists in contextual embeddings at two reference timedates.

transform(embeddings, timedates, attributes, keywords, group_embeddings: bool = True, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]

Compute time-aware self similarity.

Parameters
  • embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.

  • timedates (np.ndarray) – Array of time values.

  • attributes (np.ndarray) – Define subcorporas. Array of attribute values.

  • keywords (list) – List of keywords.

  • group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).

  • group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.

  • groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.

  • group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of time-aware similarities between embeddings at different timedates.

Return type

pd.DataFrame()

Module contents

semantics module