compshs.embedding package

Subpackages

compshs.embedding.tests package

Submodules

compshs.embedding.contextual_embedding module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

class compshs.embedding.contextual_embedding.ContextualEmbedding(transformer, model_name, tokenizer, sentence_tokenizer)[source]

Bases: object

Contextual embedding.

Parameters

transformer – Class objet for transformer model.
model_name – Model name to upload.
tokenizer – Class object for tokenizer.
sentence_tokenizer – Class object for sentence tokenizer.

clean_text(text)[source]: Text cleaning for contextual embedding use.

clean_word(word)[source]

encode_chunk(model, chunk, text, doc_id)[source]

Compute word embeddings for a chunk of text.

Parameters

model – Embedding model.
chunk – Chunk of tokenized text.
text – Original textual context.
doc_id – Id of the document.

Returns

List of dictionaries containing:

doc id: id of the original document
word: word
position: position of word within original context
embedding: tensor representing the embedding of the word
context: textual context used for word embedding

Return type

list

extract_keyword_sentences(text, keywords, sentence_tokenizer)[source]

Extract context for keywords.

Parameters

text (str) – Documents from which context are extracted.
keywords (list of str) – List of keywords.
sentence_tokenizer – Sentence tokenizer.

Returns

Dictionary of contexts:

keys: keywords
values: List of textual contexts in which the corresponding keyword appears.

Return type

dict

group_subword_embeddings(offset_mapping, embeddings, text, doc_id) → list[source]

Average embeddings for splitted words.

Parameters

offset_mapping (tensor) – Indicates the start and end positions of each token within the original text.
embeddings – Matrix of word embeddings
text (str) – Original textual context.
doc_id (int) – Id of the document in which word embeddings were computed.

Returns

List of dictionaries containing:

doc id: id of the original document
word: word
position: position of word within original context
embedding: tensor representing the embedding of the word
context: textual context used for word embedding

Return type

list

process_doc(document, doc_id, keywords) → dict[source]

Processing a document consists in:

cleaning document
splitting in chunks (if necessary)
extracting context for each keyword
embedding keyword according to each specific context

Parameters

document (str) – Textual document.
keywords (list) – List of keywords to search and embed.
doc_id (int) – Document id.
keywords – List of keywords to search and embed.

Returns

Dictionary of embeddings with the following structure:

keyword_0 (STR):

context_0 (STR):: word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …
context_1 (STR):: word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …

…

keyword_1 (STR):

context_0 (STR):: word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …
context_1 (STR):: word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …

…

Return type

dict

tokenize_and_chunk(text, tokenizer, max_length=512, stride=256)[source]

Tokenize a text.

When number of tokens is greater than max_length, result is split into chunks.

Parameters

text (str) – Original text.
tokenizer – Tokenizer.
max_length (int) – Maximal number of tokens.
stride (int) – Size of stride between token chunks. Useful in order not to lose contextual information when splitting tokens.

Returns

List of \(n\) dictionaries of tokens, with \(n\) the number of chunks. Each dictionary contains:

input ids: tensor of ids of tokens in model vocabulary

attention_mask: tensor of values indicating if a token is real (1) or comes from padding (0)

offset_mapping: tensor indicating the start and end positions of each token within the original text.

Return type

list

transform(corpus, keywords) → list[source]

Performing contextual embedding of keywods within a corpus of documents.

Parameters

corpus (list) – List of documents.
keywords (list) – List of keywords to search and embed.

Returns

A list of \(n\) embedding dictionaries, with \(n\) the number of documents in corpus.

Return type

list

Module contents

embedding module