compshs.semantics package

Subpackages

compshs.semantics.tests package

Submodules

compshs.semantics.base module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

class compshs.semantics.base.BaseSemanticShift[source]

Bases: object

Base class for Semantic Shift Detection.

Note: Semantic shift detection requires contextual word embeddings with additional attributes, e.g. time information.

compute_metric_for_timepair(embeddings, timedate_source, timedate_target, attribute_x, attribute_y, keywords_strategy, embeddings_strategy, metric_name) → list[source]

Generic computation for metrics between embeddings over time.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute
keywords_strategy – Strategy for selecting keyword intersection at different timedates.
embeddings_strategy – Strategy for selecting embeddings at different timedates.
metric_name (str) – Metric name.

Return type

List of dictionaries with metric information for all keywords.

get_common_attributes(embeddings: list, timedate_x: str, timedate_y: str) → set[source]: Compute the set of common attributes between subcorporas at two different timesteps.

get_common_keywords(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str) → set[source]: Compute the set of common keywords between subcorporas at two timesteps and for two attributes.

get_group(keyword, groups, group_names)[source]

group_embeddings(embeddings: list, timedates: ndarray, attributes: ndarray) → dict[source]

Group dictionaries of contextual embeddings by: - time - attribute - keyword

Parameters

embeddings (list) – List of dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Array of attribute values.

Return type

Single dictionary of keyword embeddings indexed by timedate, attribute, keyword.

group_output_fixed(similarities: DataFrame, groups: list, group_names: list) → Tuple[DataFrame, list][source]

Group approaching-based semantic detection output for fixed mode.

Parameters

similarities (pd.DataFrame) – DataFrame containing approaching-based semantic similarities (see shift.py classes).
groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Return type

Tuple of grouped similarities for approaching-based semantic detection methods and ordered keywords list.

group_output_sequential(similarities: DataFrame, metric: str, groups: list, group_names: list) → DataFrame[source]

Group approaching-based semantic detection output for sequential mode.

Parameters

similarities (pd.DataFrame) – DataFrame containing approaching-based semantic similarities (see shift.py classes).
metric (str) – Metric name.
groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Return type

Grouped similarities for approaching-based semantic detection methods.

compshs.semantics.concept_induction module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

class compshs.semantics.concept_induction.ConceptInduction(local_params: dict, global_params: dict, distance: str = 'cosine')[source]

Bases: object

Concept Induction (as a generalization of Word Sense Induction) framework from Lietard, et al. 2024.

Original implementation is available at https://github.com/blietard/concept-induction.

Note: This class implements the bi-level approach.

Parameters

local_params (dict) –
Dictionary of local parameters. Keys must contain:
- mode: Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm for local clustering (str, default = 'average').
- nu: Value of lobal \(\nu\) hyperparameter (int)
global_params (dict) –
Dictionary of global parameters. Keys must contain:
- mode: Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm for local clustering (str, default = 'average'.
- nu: Value of global \(\nu\) hyperparameter (int)
distance (str) – Metric used for computing distance between clustered instances (default= cosine).

References

Liétard, B., Denis, P., & Keller, M. (2024). To word senses and beyond: Inducing concepts with contextualized language models. arXiv preprint arXiv:2406.20054.

average_senses(sense_clusters: dict) → tuple[source]

Average embeddings in local sense clusters.

Parameters: sense_clusters (dict) – Dictionary of sense clusters.
Returns: Array of average senses in clusters, original corresponding list of keywords.
Return type: np.ndarray

get_concept_clusters(senses, keywords_origin: list, nu: int, distance: str = 'cosine', mode: str = 'average')[source]

Compute globally estimated concept clusters.

Formally, a concept \(c_k\) is a cluster of senses, and \(C={c_k}_{1 \leq k \leq p}\) is a partition of \(O\) in \(p\) concept clusters.

Parameters

senses –
keywords_origin (list) –
nu (int:) – Hyperparameter used in \(\tau = avg(d) - \nu \times std(d)\), with \(d\) the distribution of distances between clustered instances.
distance (str) – Distance metric computed between senses (default = 'cosine').
mode (str) – Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm (default = 'average').

Returns

A list of concepts as a soft clustering over original keywords.

Return type

list

get_sense_clusters(keyword_embeddings: dict, nu: int, distance: str = 'cosine', mode: str = 'average') → dict[source]

Compute locally estimated sense clusters (using sklearn AgglomerativeClustering algorithm).

Formally, from the set of contextual occurrences of keyword \(w\), denoted \(O^w\), it computes a partition \(S^w = \{ s_j^{w} \}_{1 \leq j \leq n_w}\).

The set of all sense clusters of all keywords is the union of all partitions, \(\bigcup_{w \in W} S^{w}\).

Parameters

keyword_embeddings (dict) – Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.
nu (int) – Hyperparameter used in \(\tau = avg(d) - \nu std(d)\), with \(d\) the distribution of distances between clustered instances.
distance (str) – Distance metric computed between keyword embeddings (default = 'cosine').
mode (str) – Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm (default = 'average').

Returns

Dictionary of sense clusters, with keywords as keys and lists of contextual embeddings as values.

Return type

dict

group_embeddings(embeddings: list) → dict[source]

Group a list of contextual embeddings by keywords. Using Lietard, et al. wording, it corresponds to contextutal occurrences of keywords.

Parameters: embeddings (list) – Pre-computed contextual word embeddings stored in a list of \(n\) elements, with \(n\) the number of documents in corpus.
Returns: Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.
Return type: dict

transform(embeddings, group_by_keywords: bool = True)[source]

Perform concept induction given precomputed contextual embeddings.

Parameters

embeddings – Pre-computed contextual word embeddings.
group_by_keywords (bool) – Set to True if embeddings are in the format of ContextualEmbedding.transform() output (default).

Returns

A list of concepts as a soft clustering over original keywords.

Return type

list

compshs.semantics.shift module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

class compshs.semantics.shift.AsApp[source]

Bases: BaseSemanticShift

Asymmetric Approaching between word embeddings.

\[\begin{split}asApp(w, a) = psim(I_{w,a,t+1}, I_{w,a^{\\prime},t}) - psim(I_{w,a,t}, I_{w,a^{\\prime},t})\end{split}\]

where \(I_{w,a,t}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(t\).

A positive value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that is close to the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).

A negative value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that moves away from the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

asapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]: Returns embeddings (embeddings for both attributes at current and previous timedates, embeddings for both attributes at previous timedate).

asapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]: Compute intersection between sets of keywords at different timedates.

compute_asapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) → list[source]

Compute asapp metric for a timepair.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute

Returns

List of dictionaries with metric information for all keywords.

Return type

list

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) → DataFrame[source]

Compute asymmetric approaching for all available keywords and attributes in contextual embeddings.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).
mode (str) – 'sequential' or 'fixed'.
time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode = 'fixed'.
group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.
groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of asymmetric approaching similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.shift.DS[source]

Bases: BaseSemanticShift

Driving Strength between word embeddings.

\[DS(w,a) = \dfrac{asApp(w,a)}{|asApp(w,a)|+|asApp(w,a^{\prime})|}\]

Driving Strength is an asymmetric time-aware normalised measure indicating how much of the total approaching between two subcorporas is done by one side.

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, time_pair: Optional[tuple] = None, groups: Optional[list] = None, group_names: Optional[list] = None) → DataFrame[source]

Compute driving strength metric for all available keywords and attributes in contextual embeddings.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).
time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time).
groups (list) – Use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – Use this parameter to rename groups of keywords.

Returns

DataFrame of driving strength similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.shift.SApp[source]

Bases: BaseSemanticShift

Symmetric approaching between word embeddings.

\[sApp(w) = psim(I_{w,a,k+1}, I_{w,a^{\prime},k+1}) - psim(I_{w,a,k}, I_{w,a^{\prime},k})\]

where \(I_{w,a,k}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(k\).

A positive value for \(sApp(w)\) indicates that two subcorporas achieved a closer word semantics over time. Conversely, a negative value indicates that word representations diverged over time.

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

compute_sapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) → list[source]

Compute sapp metric for a timepair.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute

Returns

List of dictionaries with metric information for all keywords.

Return type

list

sapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]: Returns embeddings (embeddings for both attributes at current timedate, embeddings for both attributes at previous timedate).

sapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]: Compute intersection between sets of keywords at different timedates.

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) → DataFrame[source]

Compute symmetric approaching for all available keywords and attributes in contextual embeddings.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).
mode (str) – 'sequential' or 'fixed'.
time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode = 'fixed'.
group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.
groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of symmetric approaching similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.shift.SSTA[source]

Bases: BaseSemanticShift

Time-aware Self Similarity between word embeddings.

\[SS_{TA}(w, a, k) = psim(I_{w,a,k}, I_{w,a,k+1})\]

where:

\(w\) is a word

\(a\) is an attribute

\(k\) is a timestep

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

attribute_exist_at_timedates(embeddings, attribute, timedate_source, timedate_target) → bool[source]: True if attribute exists in contextual embeddings at two reference timedates.

keyword_exist_at_timedates(embeddings, keyword, attribute, timedate_source, timedate_target) → bool[source]: True if keyword exists in contextual embeddings at two reference timedates.

transform(embeddings, timedates, attributes, keywords, group_embeddings: bool = True, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) → DataFrame[source]

Compute time-aware self similarity.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
keywords (list) – List of keywords.
group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).
group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.
groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of time-aware similarities between embeddings at different timedates.

Return type

pd.DataFrame()

Module contents

semantics module