compshs.embedding package

Subpackages

Submodules

compshs.embedding.contextual_embedding module

Created in 2025 @author: Simon Delarue <simon.delarue@telecom-paris.fr>

class compshs.embedding.contextual_embedding.ContextualEmbedding(transformer, model_name, tokenizer, sentence_tokenizer)[source]

Bases: object

Contextual embedding.

Parameters
  • transformer – Class objet for transformer model.

  • model_name – Model name to upload.

  • tokenizer – Class object for tokenizer.

  • sentence_tokenizer – Class object for sentence tokenizer.

clean_text(text)[source]

Text cleaning for contextual embedding use.

clean_word(word)[source]
encode_chunk(model, chunk, text, doc_id)[source]

Compute word embeddings for a chunk of text.

Parameters
  • model – Embedding model.

  • chunk – Chunk of tokenized text.

  • text – Original textual context.

  • doc_id – Id of the document.

Returns

List of dictionaries containing:
  • doc id: id of the original document

  • word: word

  • position: position of word within original context

  • embedding: tensor representing the embedding of the word

  • context: textual context used for word embedding

Return type

list

extract_keyword_sentences(text, keywords, sentence_tokenizer)[source]

Extract context for keywords.

Parameters
  • text (str) – Documents from which context are extracted.

  • keywords (list of str) – List of keywords.

  • sentence_tokenizer – Sentence tokenizer.

Returns

Dictionary of contexts:
  • keys: keywords

  • values: List of textual contexts in which the corresponding keyword appears.

Return type

dict

group_subword_embeddings(offset_mapping, embeddings, text, doc_id) list[source]

Average embeddings for splitted words.

Parameters
  • offset_mapping (tensor) – Indicates the start and end positions of each token within the original text.

  • embeddings – Matrix of word embeddings

  • text (str) – Original textual context.

  • doc_id (int) – Id of the document in which word embeddings were computed.

Returns

List of dictionaries containing:
  • doc id: id of the original document

  • word: word

  • position: position of word within original context

  • embedding: tensor representing the embedding of the word

  • context: textual context used for word embedding

Return type

list

process_doc(document, doc_id, keywords) dict[source]
Processing a document consists in:
  • cleaning document

  • splitting in chunks (if necessary)

  • extracting context for each keyword

  • embedding keyword according to each specific context

Parameters
  • document (str) – Textual document.

  • keywords (list) – List of keywords to search and embed.

  • doc_id (int) – Document id.

  • keywords – List of keywords to search and embed.

Returns

Dictionary of embeddings with the following structure:

keyword_0 (STR):
context_0 (STR):

word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …

context_1 (STR):

word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …

keyword_1 (STR):
context_0 (STR):

word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …

context_1 (STR):

word_0 (STR): embedding (tensor) word_1 (STR): embedding (tensor) …

Return type

dict

tokenize_and_chunk(text, tokenizer, max_length=512, stride=256)[source]

Tokenize a text.

When number of tokens is greater than max_length, result is split into chunks.

Parameters
  • text (str) – Original text.

  • tokenizer – Tokenizer.

  • max_length (int) – Maximal number of tokens.

  • stride (int) – Size of stride between token chunks. Useful in order not to lose contextual information when splitting tokens.

Returns

List of \(n\) dictionaries of tokens, with \(n\) the number of chunks. Each dictionary contains:

  • input ids: tensor of ids of tokens in model vocabulary

  • attention_mask: tensor of values indicating if a token is real (1) or comes from padding (0)

  • offset_mapping: tensor indicating the start and end positions of each token within the original text.

Return type

list

transform(corpus, keywords) list[source]

Performing contextual embedding of keywods within a corpus of documents.

Parameters
  • corpus (list) – List of documents.

  • keywords (list) – List of keywords to search and embed.

Returns

A list of \(n\) embedding dictionaries, with \(n\) the number of documents in corpus.

Return type

list

Module contents

embedding module