Semantics
Tools for semantic analysis.
Concept induction
- class compshs.semantics.ConceptInduction(local_params: dict, global_params: dict, distance: str = 'cosine')[source]
Concept Induction (as a generalization of Word Sense Induction) framework from Lietard, et al. 2024.
Original implementation is available at https://github.com/blietard/concept-induction.
Note: This class implements the bi-level approach.
- Parameters
local_params (dict) –
- Dictionary of local parameters. Keys must contain:
mode: Which linkage criterion to use in sklearn’s
AgglomerativeClustering()algorithm for local clustering (str, default ='average').nu: Value of lobal \(\nu\) hyperparameter (int)
global_params (dict) –
- Dictionary of global parameters. Keys must contain:
mode: Which linkage criterion to use in sklearn’s
AgglomerativeClustering()algorithm for local clustering (str, default ='average'.nu: Value of global \(\nu\) hyperparameter (int)
distance (str) – Metric used for computing distance between clustered instances (default= cosine).
References
Liétard, B., Denis, P., & Keller, M. (2024). To word senses and beyond: Inducing concepts with contextualized language models. arXiv preprint arXiv:2406.20054.
- average_senses(sense_clusters: dict) tuple[source]
Average embeddings in local sense clusters.
- Parameters
sense_clusters (dict) – Dictionary of sense clusters.
- Returns
Array of average senses in clusters, original corresponding list of keywords.
- Return type
np.ndarray
- get_concept_clusters(senses, keywords_origin: list, nu: int, distance: str = 'cosine', mode: str = 'average')[source]
Compute globally estimated concept clusters.
Formally, a concept \(c_k\) is a cluster of senses, and \(C={c_k}_{1 \leq k \leq p}\) is a partition of \(O\) in \(p\) concept clusters.
- Parameters
senses –
keywords_origin (list) –
nu (int:) – Hyperparameter used in \(\tau = avg(d) - \nu \times std(d)\), with \(d\) the distribution of distances between clustered instances.
distance (str) – Distance metric computed between senses (default =
'cosine').mode (str) – Which linkage criterion to use in sklearn’s
AgglomerativeClustering()algorithm (default ='average').
- Returns
A list of concepts as a soft clustering over original keywords.
- Return type
list
- get_sense_clusters(keyword_embeddings: dict, nu: int, distance: str = 'cosine', mode: str = 'average') dict[source]
Compute locally estimated sense clusters (using sklearn
AgglomerativeClusteringalgorithm).Formally, from the set of contextual occurrences of keyword \(w\), denoted \(O^w\), it computes a partition \(S^w = \{ s_j^{w} \}_{1 \leq j \leq n_w}\).
The set of all sense clusters of all keywords is the union of all partitions, \(\bigcup_{w \in W} S^{w}\).
- Parameters
keyword_embeddings (dict) – Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.
nu (int) – Hyperparameter used in \(\tau = avg(d) - \nu std(d)\), with \(d\) the distribution of distances between clustered instances.
distance (str) – Distance metric computed between keyword embeddings (default =
'cosine').mode (str) – Which linkage criterion to use in sklearn’s
AgglomerativeClustering()algorithm (default ='average').
- Returns
Dictionary of sense clusters, with keywords as keys and lists of contextual embeddings as values.
- Return type
dict
- group_embeddings(embeddings: list) dict[source]
Group a list of contextual embeddings by keywords. Using Lietard, et al. wording, it corresponds to contextutal occurrences of keywords.
- Parameters
embeddings (list) – Pre-computed contextual word embeddings stored in a list of \(n\) elements, with \(n\) the number of documents in corpus.
- Returns
Dictionary of embeddings, with keyword as keys and list of contextual embeddings as values.
- Return type
dict
- transform(embeddings, group_by_keywords: bool = True)[source]
Perform concept induction given precomputed contextual embeddings.
- Parameters
embeddings – Pre-computed contextual word embeddings.
group_by_keywords (bool) – Set to
Trueif embeddings are in the format ofContextualEmbedding.transform()output (default).
- Returns
A list of concepts as a soft clustering over original keywords.
- Return type
list
Semantic shift detection
- class compshs.semantics.SSTA[source]
Time-aware Self Similarity between word embeddings.
\[SS_{TA}(w, a, k) = psim(I_{w,a,k}, I_{w,a,k+1})\]where:
\(w\) is a word
\(a\) is an attribute
\(k\) is a timestep
References
Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.
- attribute_exist_at_timedates(embeddings, attribute, timedate_source, timedate_target) bool[source]
True if attribute exists in contextual embeddings at two reference timedates.
- keyword_exist_at_timedates(embeddings, keyword, attribute, timedate_source, timedate_target) bool[source]
True if keyword exists in contextual embeddings at two reference timedates.
- transform(embeddings, timedates, attributes, keywords, group_embeddings: bool = True, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]
Compute time-aware self similarity.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
keywords (list) – List of keywords.
group_embeddings (bool) – If
True, group embeddings by timedate, attribute, keyword (default).group_output (bool) – If
True, group output dataframe using groups. Average is used for grouping.groups (list) – When group_output is set to
True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.group_names (list) – When group_output is set to
True, use this parameter to rename groups of keywords.
- Returns
DataFrame of time-aware similarities between embeddings at different timedates.
- Return type
pd.DataFrame()
- class compshs.semantics.SApp[source]
Symmetric approaching between word embeddings.
\[sApp(w) = psim(I_{w,a,k+1}, I_{w,a^{\prime},k+1}) - psim(I_{w,a,k}, I_{w,a^{\prime},k})\]where \(I_{w,a,k}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(k\).
A positive value for \(sApp(w)\) indicates that two subcorporas achieved a closer word semantics over time. Conversely, a negative value indicates that word representations diverged over time.
References
Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.
- compute_sapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) list[source]
Compute sapp metric for a timepair.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute
- Returns
List of dictionaries with metric information for all keywords.
- Return type
list
- sapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]
Returns embeddings (embeddings for both attributes at current timedate, embeddings for both attributes at previous timedate).
- sapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]
Compute intersection between sets of keywords at different timedates.
- transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]
Compute symmetric approaching for all available keywords and attributes in contextual embeddings.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If
True, group embeddings by timedate, attribute, keyword (default).mode (str) –
'sequential'or'fixed'.time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode =
'fixed'.group_output (bool) – If
True, group output dataframe using groups. Average is used for grouping.groups (list) – When group_output is set to
True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.group_names (list) – When group_output is set to
True, use this parameter to rename groups of keywords.
- Returns
DataFrame of symmetric approaching similarities between embeddings at different timedates.
- Return type
pd.DataFrame()
- class compshs.semantics.AsApp[source]
Asymmetric Approaching between word embeddings.
\[\begin{split}asApp(w, a) = psim(I_{w,a,t+1}, I_{w,a^{\\prime},t}) - psim(I_{w,a,t}, I_{w,a^{\\prime},t})\end{split}\]where \(I_{w,a,t}\) is the set of contextual embeddings of word \(w\), with attribute \(a\), at timestep \(t\).
A positive value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that is close to the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).
A negative value of \(asApp\) indicates that attribute \(a\) in the pair \((a,a^{\prime})\) has a recent (time \(t\)) representation of a word \(w\) that moves away from the one initially used by attribute \(a^{\prime}\) (time \(t-1\)).
References
Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.
- asapp_embeddings_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str, keyword: str)[source]
Returns embeddings (embeddings for both attributes at current and previous timedates, embeddings for both attributes at previous timedate).
- asapp_keywords_strategy(embeddings: list, attribute_x: str, attribute_y: str, timedate_x: str, timedate_y: str)[source]
Compute intersection between sets of keywords at different timedates.
- compute_asapp_for_timepair(embeddings: list, timedate_source: str, timedate_target: str, attribute_x: str, attribute_y: str) list[source]
Compute asapp metric for a timepair.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedate_source (str) – Source timedate
timedate_target (str) – Target timedate
attribute_x (str) – First attribute
attribute_y (str) – Second attribute
- Returns
List of dictionaries with metric information for all keywords.
- Return type
list
- transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, mode: str = 'sequential', time_pair: Optional[tuple] = None, group_output: bool = False, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]
Compute asymmetric approaching for all available keywords and attributes in contextual embeddings.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If
True, group embeddings by timedate, attribute, keyword (default).mode (str) –
'sequential'or'fixed'.time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time), if mode =
'fixed'.group_output (bool) – If
True, group output dataframe using groups. Average is used for grouping.groups (list) – When group_output is set to
True, use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.group_names (list) – When group_output is set to
True, use this parameter to rename groups of keywords.
- Returns
DataFrame of asymmetric approaching similarities between embeddings at different timedates.
- Return type
pd.DataFrame()
- class compshs.semantics.DS[source]
Driving Strength between word embeddings.
\[DS(w,a) = \dfrac{asApp(w,a)}{|asApp(w,a)|+|asApp(w,a^{\prime})|}\]Driving Strength is an asymmetric time-aware normalised measure indicating how much of the total approaching between two subcorporas is done by one side.
References
Soler, A. G., Labeau, M., & Clavel, C. (2023). Measuring lexico-semantic alignment in debates with contextualized word representations. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023) (pp. 50-63). Association for Computational Linguistics.
- transform(embeddings: list, timedates: Optional[ndarray] = None, attributes: Optional[ndarray] = None, group_embeddings: bool = True, time_pair: Optional[tuple] = None, groups: Optional[list] = None, group_names: Optional[list] = None) DataFrame[source]
Compute driving strength metric for all available keywords and attributes in contextual embeddings.
- Parameters
embeddings (list) – List of Dictionaries of embeddings in the form of
ContextualEmbedding.transform()output.timedates (np.ndarray) – Array of time values.
attributes (np.ndarray) – Define subcorporas. Array of attribute values.
group_embeddings (bool) – If
True, group embeddings by timedate, attribute, keyword (default).time_pair (tuple) – Tuple of timedates in str format (prev_time, curr_time).
groups (list) – Use this parameter to group multiple keywords together. Expected format is a list of lists of keywords.
group_names (list) – Use this parameter to rename groups of keywords.
- Returns
DataFrame of driving strength similarities between embeddings at different timedates.
- Return type
pd.DataFrame()