compshs.embedding package
Subpackages
Submodules
compshs.embedding.contextual_embedding module
Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>
- class compshs.embedding.contextual_embedding.ContextualEmbedding(transformer, model_name, tokenizer, sentence_tokenizer)[source]
Bases:
objectContextual embedding.
- Parameters
transformer – Class objet for transformer model.
model_name – Model name to upload.
tokenizer – Class object for tokenizer.
sentence_tokenizer – Class object for sentence tokenizer.
- encode_chunk(model, chunk, text, doc_id)[source]
Compute word embeddings for a chunk of text.
- Parameters
model – Embedding model.
chunk – Chunk of tokenized text.
text – Original textual context.
doc_id – Id of the document.
- Returns
- List of dictionaries containing:
doc id: id of the original document
word: word
position: position of word within original context
embedding: tensor representing the embedding of the word
context: textual context used for word embedding
- Return type
list
- extract_keyword_sentences(text, keywords, sentence_tokenizer)[source]
Extract context for keywords.
- Parameters
text (str) – Documents from which context are extracted.
keywords (list of str) – List of keywords.
sentence_tokenizer – Sentence tokenizer.
- Returns
- Dictionary of contexts:
keys: keywords
values: List of textual contexts in which the corresponding keyword appears.
- Return type
dict
- group_subword_embeddings(offset_mapping, embeddings, text, doc_id) list[source]
Average embeddings for splitted words.
- Parameters
offset_mapping (tensor) – Indicates the start and end positions of each token within the original text.
embeddings – Matrix of word embeddings
text (str) – Original textual context.
doc_id (int) – Id of the document in which word embeddings were computed.
- Returns
- List of dictionaries containing:
doc id: id of the original document
word: word
position: position of word within original context
embedding: tensor representing the embedding of the word
context: textual context used for word embedding
- Return type
list
- process_doc(document, doc_id, keywords) dict[source]
- Processing a document consists in:
cleaning document
splitting in chunks (if necessary)
extracting context for each keyword
embedding keyword according to each specific context
- Parameters
document (str) – Textual document.
keywords (list) – List of keywords to search and embed.
doc_id (int) – Document id.
keywords – List of keywords to search and embed.
- Returns
Dictionary of embeddings with the following structure:
- keyword_0 (STR):
- context_0 (STR):
word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …
- context_1 (STR):
word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …
…
- keyword_1 (STR):
- context_0 (STR):
word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …
- context_1 (STR):
word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …
…
…
- Return type
dict
- tokenize_and_chunk(text, tokenizer, max_length=512, stride=256)[source]
Tokenize a text.
When number of tokens is greater than max_length, result is split into chunks.
- Parameters
text (str) – Original text.
tokenizer – Tokenizer.
max_length (int) – Maximal number of tokens.
stride (int) – Size of stride between token chunks. Useful in order not to lose contextual information when splitting tokens.
- Returns
List of \(n\) dictionaries of tokens, with \(n\) the number of chunks. Each dictionary contains:
input ids: tensor of ids of tokens in model vocabulary
attention_mask: tensor of values indicating if a token is real (1) or comes from padding (0)
offset_mapping: tensor indicating the start and end positions of each token within the original text.
- Return type
list
- transform(corpus, keywords) list[source]
Performing contextual embedding of keywods within a corpus of documents.
- Parameters
corpus (list) – List of documents.
keywords (list) – List of keywords to search and embed.
- Returns
A list of \(n\) embedding dictionaries, with \(n\) the number of documents in corpus.
- Return type
list
Module contents
embedding module