Semantics

Tools for semantic analysis.

Concept induction

class compshs.semantics.ConceptInduction(local_params: dict, global_params: dict, distance: str = 'cosine')[source]

Concept Induction (as a generalization of Word Sense Induction) framework from Lietard, et al. 2024.

Original implementation is available at https://github.com/blietard/concept-induction.

Note: This class implements the bi-level approach.

Parameters

local_params (dict) –
Dictionary of local parameters. Keys must contain:
- mode: Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm for local clustering (str, default = 'average').
- nu: Value of lobal \(\nu\) hyperparameter (int)
global_params (dict) –
Dictionary of global parameters. Keys must contain:
- mode: Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm for local clustering (str, default = 'average'.
- nu: Value of global \(\nu\) hyperparameter (int)
distance (str) – Metric used for computing distance between clustered instances (default= cosine).

References

Liétard, B., Denis, P., & Keller, M. (2024). To word senses and beyond: Inducing concepts with contextualized language models. arXiv preprint arXiv:2406.20054.

average_senses(sense_clusters: dict) → tuple[source]

Average embeddings in local sense clusters.

Parameters: sense_clusters (dict) – Dictionary of sense clusters.
Returns: Array of average senses in clusters, original corresponding list of keywords.
Return type: np.ndarray

get_concept_clusters(senses, keywords_origin: list, nu: int, distance: str = 'cosine', mode: str = 'average')[source]

Compute globally estimated concept clusters.

Formally, a concept \(c_k\) is a cluster of senses, and \(C={c_k}_{1 \leq k \leq p}\) is a partition of \(O\) in \(p\) concept clusters.

Parameters

senses –
keywords_origin (list) –
nu (int:) – Hyperparameter used in \(\tau = avg(d) - \nu \times std(d)\), with \(d\) the distribution of distances between clustered instances.
distance (str) – Distance metric computed between senses (default = 'cosine').
mode (str) – Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm (default = 'average').

Returns

A list of concepts as a soft clustering over original keywords.

Return type

list

get_sense_clusters(keyword_embeddings: dict, nu: int, distance: str = 'cosine', mode: str = 'average') → dict[source]

Compute locally estimated sense clusters (using sklearn AgglomerativeClustering algorithm).

Formally, from the set of contextual occurrences of keyword \(w\), denoted \(O^w\), it computes a partition \(S^w = \{ s_j^{w} \}_{1 \leq j \leq n_w}\).

The set of all sense clusters of all keywords is the union of all partitions, \(\bigcup_{w \in W} S^{w}\).

Parameters

keyword_embeddings (dict) – Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.
nu (int) – Hyperparameter used in \(\tau = avg(d) - \nu std(d)\), with \(d\) the distribution of distances between clustered instances.
distance (str) – Distance metric computed between keyword embeddings (default = 'cosine').
mode (str) – Which linkage criterion to use in sklearn’s AgglomerativeClustering() algorithm (default = 'average').

Returns

Dictionary of sense clusters, with keywords as keys and lists of contextual embeddings as values.

Return type

dict

group_embeddings(embeddings: list) → dict[source]

Group a list of contextual embeddings by keywords. Using Lietard, et al. wording, it corresponds to contextutal occurrences of keywords.

Parameters: embeddings (list) – Pre-computed contextual word embeddings stored in a list of \(n\) elements, with \(n\) the number of documents in corpus.
Returns: Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.
Return type: dict

transform(embeddings, group_by_keywords: bool = True)[source]

Perform concept induction given precomputed contextual embeddings.

Parameters

embeddings – Pre-computed contextual word embeddings.
group_by_keywords (bool) – Set to True if embeddings are in the format of ContextualEmbedding.transform() output (default).

Returns

A list of concepts as a soft clustering over original keywords.

Return type

list

Semantic shift detection

class compshs.semantics.SSTA[source]

Time-aware Self Similarity between word embeddings.

\[SS_{TA}(w, a, k) = psim(I_{w,a,k}, I_{w,a,k+1})\]

where:

\(w\) is a word

\(a\) is an attribute

\(k\) is a timestep

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

attribute_exist_at_timedates(embeddings, attribute, timedate_source, timedate_target) → bool[source]: True if attribute exists in contextual embeddings at two reference timedates.

keyword_exist_at_timedates(embeddings, keyword, attribute, timedate_source, timedate_target) → bool[source]: True if keyword exists in contextual embeddings at two reference timedates.

transform(embeddings, timedates, attributes, keywords, group_embeddings: bool = True, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) → DataFrame[source]

Compute time-aware self similarity.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
keywords (list) – List of keywords.
group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).
group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.
groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of time-aware similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.SApp[source]

Symmetric approaching between word embeddings.

\[sApp(w) = psim(I_{w,a,k+1}, I_{w,a^{\prime},k+1}) - psim(I_{w,a,k}, I_{w,a^{\prime},k})\]

where \(I_{w,a,k}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(k\).

A positive value for \(sApp(w)\) indicates that two subcorporas achieved a closer word semantics over time. Conversely, a negative value indicates that word representations diverged over time.

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

compute_sapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) → list[source]

Compute sapp metric for a timepair.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute

Returns

List of dictionaries with metric information for all keywords.

Return type

list

sapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]: Returns embeddings (embeddings for both attributes at current timedate, embeddings for both attributes at previous timedate).

sapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]: Compute intersection between sets of keywords at different timedates.

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) → DataFrame[source]

Compute symmetric approaching for all available keywords and attributes in contextual embeddings.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).
mode (str) – 'sequential' or 'fixed'.
time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode = 'fixed'.
group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.
groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of symmetric approaching similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.AsApp[source]

Asymmetric Approaching between word embeddings.

\[\begin{split}asApp(w, a) = psim(I_{w,a,t+1}, I_{w,a^{\\prime},t}) - psim(I_{w,a,t}, I_{w,a^{\\prime},t})\end{split}\]

where \(I_{w,a,t}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(t\).

A positive value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that is close to the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).

A negative value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that moves away from the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

asapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]: Returns embeddings (embeddings for both attributes at current and previous timedates, embeddings for both attributes at previous timedate).

asapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]: Compute intersection between sets of keywords at different timedates.

compute_asapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) → list[source]

Compute asapp metric for a timepair.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute

Returns

List of dictionaries with metric information for all keywords.

Return type

list

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) → DataFrame[source]

Compute asymmetric approaching for all available keywords and attributes in contextual embeddings.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).
mode (str) – 'sequential' or 'fixed'.
time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode = 'fixed'.
group_output (bool) – If True, group output dataframe using groups. Average is used for grouping.
groups (list) – When group_output is set to True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – When group_output is set to True, use this parameter to rename groups of keywords.

Returns

DataFrame of asymmetric approaching similarities between embeddings at different timedates.

Return type

pd.DataFrame()

class compshs.semantics.DS[source]

Driving Strength between word embeddings.

\[DS(w,a) = \dfrac{asApp(w,a)}{|asApp(w,a)|+|asApp(w,a^{\prime})|}\]

Driving Strength is an asymmetric time-aware normalised measure indicating how much of the total approaching between two subcorporas is done by one side.

References

Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.

transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, time_pair: Optional[tuple] = None, groups: Optional[list] = None, group_names: Optional[list] = None) → DataFrame[source]

Compute driving strength metric for all available keywords and attributes in contextual embeddings.

Parameters

embeddings (list) – List of Dictionaries of embeddings in the form of ContextualEmbedding.transform() output.
timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If True, group embeddings by timedate, attribute, keyword (default).
time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time).
groups (list) – Use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – Use this parameter to rename groups of keywords.

Returns

DataFrame of driving strength similarities between embeddings at different timedates.

Return type

pd.DataFrame()