compshs.semantics package
Subpackages
Submodules
compshs.semantics.base module
Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>
- class compshs.semantics.base.BaseSemanticShift[source]
Bases:
objectBase class for Semantic Shift Detection.
Note: Semantic shift detection requires contextual word embeddings with additional attributes, e.g. time information.
- compute_metric_for_timepair(embeddings, timedate_source, timedate_target, attribute_x, attribute_y, keywords_strategy, embeddings_strategy, metric_name) list[source]
Generic computation for metrics between embeddings over time.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute
keywords_strategy – Strategy for selecting keyword intersection at different timedates.
embeddings_strategy – Strategy for selecting embeddings at different timedates.
metric_name (str) – Metric name.
- Return type
List of dictionaries with metric information for all keywords.
- get_common_attributes(embeddings: list, timedate_x: str, timedate_y: str) set[source]
Compute the set of common attributes between subcorporas at two different timesteps.
- get_common_keywords(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str) set[source]
Compute the set of common keywords between subcorporas at two timesteps and for two attributes.
- group_embeddings(embeddings: list, timedates: ndarray, attributes: ndarray) dict[source]
Group dictionaries of contextual embeddings by: - time - attribute - keyword
- Parameters
embeddings (list) – List of dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Array of attribute values.
- Return type
Single dictionary of keyword embeddings indexed by timedate, attribute, keyword.
- group_output_fixed(similarities: DataFrame, groups: list, group_names: list) Tuple[DataFrame, list][source]
Group approaching-based semantic detection output for fixed mode.
- Parameters
similarities (pd.DataFrame) – DataFrame containing approaching-based semantic similarities (see shift.py classes).
groups (list) – When group_output is set to
True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.group_names (list) – When group_output is set to
True, use this parameter to rename groups of keywords.
- Return type
Tuple of grouped similarities for approaching-based semantic detection methods and ordered keywords list.
- group_output_sequential(similarities: DataFrame, metric: str, groups: list, group_names: list) DataFrame[source]
Group approaching-based semantic detection output for sequential mode.
- Parameters
similarities (pd.DataFrame) – DataFrame containing approaching-based semantic similarities (see shift.py classes).
metric (str) – Metric name.
groups (list) – When group_output is set to
True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.group_names (list) – When group_output is set to
True, use this parameter to rename groups of keywords.
- Return type
Grouped similarities for approaching-based semantic detection methods.
compshs.semantics.concept_induction module
Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>
- class compshs.semantics.concept_induction.ConceptInduction(local_params: dict, global_params: dict, distance: str = 'cosine')[source]
Bases:
objectConcept Induction (as a generalization of Word Sense Induction) framework from Lietard, et al. 2024.
Original implementation is available at https://github.com/blietard/concept-induction.
Note: This class implements the bi-level approach.
- Parameters
local_params (dict) –
- Dictionary of local parameters. Keys must contain:
mode: Which linkage criterion to use in sklearn’s
AgglomerativeClustering()algorithm for local clustering (str, default ='average').nu: Value of lobal \(\nu\) hyperparameter (int)
global_params (dict) –
- Dictionary of global parameters. Keys must contain:
mode: Which linkage criterion to use in sklearn’s
AgglomerativeClustering()algorithm for local clustering (str, default ='average'.nu: Value of global \(\nu\) hyperparameter (int)
distance (str) – Metric used for computing distance between clustered instances (default= cosine).
References
Liétard, B., Denis, P., & Keller, M. (2024). To word senses and beyond: Inducing concepts with contextualized language models. arXiv preprint arXiv:2406.20054.
- average_senses(sense_clusters: dict) tuple[source]
Average embeddings in local sense clusters.
- Parameters
sense_clusters (dict) – Dictionary of sense clusters.
- Returns
Array of average senses in clusters, original corresponding list of keywords.
- Return type
np.ndarray
- get_concept_clusters(senses, keywords_origin: list, nu: int, distance: str = 'cosine', mode: str = 'average')[source]
Compute globally estimated concept clusters.
Formally, a concept \(c_k\) is a cluster of senses, and \(C={c_k}_{1 \leq k \leq p}\) is a partition of \(O\) in \(p\) concept clusters.
- Parameters
senses –
keywords_origin (list) –
nu (int:) – Hyperparameter used in \(\tau = avg(d) - \nu \times std(d)\), with \(d\) the distribution of distances between clustered instances.
distance (str) – Distance metric computed between senses (default =
'cosine').mode (str) – Which linkage criterion to use in sklearn’s
AgglomerativeClustering()algorithm (default ='average').
- Returns
A list of concepts as a soft clustering over original keywords.
- Return type
list
- get_sense_clusters(keyword_embeddings: dict, nu: int, distance: str = 'cosine', mode: str = 'average') dict[source]
Compute locally estimated sense clusters (using sklearn
AgglomerativeClusteringalgorithm).Formally, from the set of contextual occurrences of keyword \(w\), denoted \(O^w\), it computes a partition \(S^w = \{ s_j^{w} \}_{1 \leq j \leq n_w}\).
The set of all sense clusters of all keywords is the union of all partitions, \(\bigcup_{w \in W} S^{w}\).
- Parameters
keyword_embeddings (dict) – Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.
nu (int) – Hyperparameter used in \(\tau = avg(d) - \nu std(d)\), with \(d\) the distribution of distances between clustered instances.
distance (str) – Distance metric computed between keyword embeddings (default =
'cosine').mode (str) – Which linkage criterion to use in sklearn’s
AgglomerativeClustering()algorithm (default ='average').
- Returns
Dictionary of sense clusters, with keywords as keys and lists of contextual embeddings as values.
- Return type
dict
- group_embeddings(embeddings: list) dict[source]
Group a list of contextual embeddings by keywords. Using Lietard, et al. wording, it corresponds to contextutal occurrences of keywords.
- Parameters
embeddings (list) – Pre-computed contextual word embeddings stored in a list of \(n\) elements, with \(n\) the number of documents in corpus.
- Returns
Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.
- Return type
dict
- transform(embeddings, group_by_keywords: bool = True)[source]
Perform concept induction given precomputed contextual embeddings.
- Parameters
embeddings – Pre-computed contextual word embeddings.
group_by_keywords (bool) – Set to
Trueif embeddings are in the format ofContextualEmbedding.transform()output (default).
- Returns
A list of concepts as a soft clustering over original keywords.
- Return type
list
compshs.semantics.shift module
Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>
- class compshs.semantics.shift.AsApp[source]
Bases:
BaseSemanticShiftAsymmetric Approaching between word embeddings.
\[\begin{split}asApp(w, a) = psim(I_{w,a,t+1}, I_{w,a^{\\prime},t}) - psim(I_{w,a,t}, I_{w,a^{\\prime},t})\end{split}\]where \(I_{w,a,t}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(t\).
A positive value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that is close to the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).
A negative value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that moves away from the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).
References
Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.
- asapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]
Returns embeddings (embeddings for both attributes at current and previous timedates, embeddings for both attributes at previous timedate).
- asapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]
Compute intersection between sets of keywords at different timedates.
- compute_asapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) list[source]
Compute asapp metric for a timepair.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute
- Returns
List of dictionaries with metric information for all keywords.
- Return type
list
- transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]
Compute asymmetric approaching for all available keywords and attributes in contextual embeddings.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If
True, group embeddings by timedate, attribute, keyword (default).mode (str) –
'sequential'or'fixed'.time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode =
'fixed'.group_output (bool) – If
True, group output dataframe using groups. Average is used for grouping.groups (list) – When group_output is set to
True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.group_names (list) – When group_output is set to
True, use this parameter to rename groups of keywords.
- Returns
DataFrame of asymmetric approaching similarities between embeddings at different timedates.
- Return type
pd.DataFrame()
- class compshs.semantics.shift.DS[source]
Bases:
BaseSemanticShiftDriving Strength between word embeddings.
\[DS(w,a) = \dfrac{asApp(w,a)}{|asApp(w,a)|+|asApp(w,a^{\prime})|}\]Driving Strength is an asymmetric time-aware normalised measure indicating how much of the total approaching between two subcorporas is done by one side.
References
Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.
- transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, time_pair: Optional[tuple] = None, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]
Compute driving strength metric for all available keywords and attributes in contextual embeddings.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If
True, group embeddings by timedate, attribute, keyword (default).time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time).
groups (list) – Use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – Use this parameter to rename groups of keywords.
- Returns
DataFrame of driving strength similarities between embeddings at different timedates.
- Return type
pd.DataFrame()
- class compshs.semantics.shift.SApp[source]
Bases:
BaseSemanticShiftSymmetric approaching between word embeddings.
\[sApp(w) = psim(I_{w,a,k+1}, I_{w,a^{\prime},k+1}) - psim(I_{w,a,k}, I_{w,a^{\prime},k})\]where \(I_{w,a,k}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(k\).
A positive value for \(sApp(w)\) indicates that two subcorporas achieved a closer word semantics over time. Conversely, a negative value indicates that word representations diverged over time.
References
Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.
- compute_sapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) list[source]
Compute sapp metric for a timepair.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute
- Returns
List of dictionaries with metric information for all keywords.
- Return type
list
- sapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]
Returns embeddings (embeddings for both attributes at current timedate, embeddings for both attributes at previous timedate).
- sapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]
Compute intersection between sets of keywords at different timedates.
- transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]
Compute symmetric approaching for all available keywords and attributes in contextual embeddings.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If
True, group embeddings by timedate, attribute, keyword (default).mode (str) –
'sequential'or'fixed'.time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode =
'fixed'.group_output (bool) – If
True, group output dataframe using groups. Average is used for grouping.groups (list) – When group_output is set to
True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.group_names (list) – When group_output is set to
True, use this parameter to rename groups of keywords.
- Returns
DataFrame of symmetric approaching similarities between embeddings at different timedates.
- Return type
pd.DataFrame()
- class compshs.semantics.shift.SSTA[source]
Bases:
BaseSemanticShiftTime-aware Self Similarity between word embeddings.
\[SS_{TA}(w, a, k) = psim(I_{w,a,k}, I_{w,a,k+1})\]where:
\(w\) is a word
\(a\) is an attribute
\(k\) is a timestep
References
Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.
- attribute_exist_at_timedates(embeddings, attribute, timedate_source, timedate_target) bool[source]
True if attribute exists in contextual embeddings at two reference timedates.
- keyword_exist_at_timedates(embeddings, keyword, attribute, timedate_source, timedate_target) bool[source]
True if keyword exists in contextual embeddings at two reference timedates.
- transform(embeddings, timedates, attributes, keywords, group_embeddings: bool = True, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]
Compute time-aware self similarity.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
keywords (list) – List of keywords.
group_embeddings (bool) – If
True, group embeddings by timedate, attribute, keyword (default).group_output (bool) – If
True, group output dataframe using groups. Average is used for grouping.groups (list) – When group_output is set to
True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.group_names (list) – When group_output is set to
True, use this parameter to rename groups of keywords.
- Returns
DataFrame of time-aware similarities between embeddings at different timedates.
- Return type
pd.DataFrame()
Module contents
semantics module