Text

Tools for textual analysis.

Preprocessing

class compshs.text.Preprocess(lang: str = 'en_core_web_sm', exclude_stop_words: bool = True, exclude_punctuation: bool = True, exclude_numbers: bool = True, lemmatize: bool = True, batch_size: int = 10, chunk_size: int = 500000)[source]

Preprocessing of a corpus of documents.

Parameters

lang (str) – Spacy language model name ('en_core_web_sm').
exclude_stop_words (bool) – If True, exclude stopwords (default).
exclude_punctuation (bool) – If True, exclude punctuation (default).
exclude_numbers (bool) – If True, exclude numbers (default).
lemmatize (bool) – If True, lemmatize tokens (default).
batch_size (int) – Number of documents to process in each batch (default = 10).
chunk_size (int) – Maximum length of a piece of text. Beyond this length, the document is divided into chunks (default = 500000).
nlp – Spacy model build upon lang parameter.

fit()[source]: Fit algorithm to the data.

transform(corpus: list) → list[source]

Preprocess corpus:

remove stopwords
remove punctuation
remove numbers
extract lemmatized tokens
set tokens in lowercase

Parameters: corpus (list) – List of documents.
Returns: List of preprocessed documents.
Return type: list

Frequency

class compshs.text.FrequencyCounter(vectorizer_name: str = 'tf', ngram_range: Tuple = (1, 1), analyzer: str = 'word', max_df: Union[float, int] = 1.0, min_df: Union[float, int] = 1)[source]

Counter of frequencies over corpus.

Parameters

vectorizer_name (str) –
Vectorizer name.
- 'token', Count of token occurrences over documents in corpus.
- 'tfidf', Tf-idf count over documents in corpus.
ngram_range (Tuple) –
analyzer (str) –
max_df (float or int) –
min_df (float or int) –

fit(corpus: list) → FrequencyCounter[source]

Fit algorithm to the corpus.

Parameters: corpus (list) – List of (preprocessed) documents.
Returns: self
Return type: FrequencyCounter()

fit_transform(corpus: list, *args, **kwargs)[source]

Fit and transform data.

Parameters: corpus (list) – List of (preprocessed) documents.
Returns: Tuple of token names and frequencies matrix.
Return type: tuple

get_token_names() → ndarray[source]

Get token names.

Returns: Array of token names.
Return type: np.ndarray

transform(corpus: list) → Tuple[source]

Compute frequencies over (preprocessed) corpus.

Parameters: corpus (list) – List of (preprocessed) documents.
Returns: Tuple of token names and frequencies matrix.
Return type: tuple

Feature selection

class compshs.text.FeatureSelection[source]

Feature selection.

get_df_from_corpus(corpus: list, attributes: list) → DataFrame[source]

Convert a list of documents with attribute information into a pandas DataFrame object.

Parameters

corpus (list) – List of documents.
attributes (list) – list (with same length as corpus) of attributes.

Returns

DataFrame with corpus information.

Return type

pd.DataFrame()

spacy_doc_from_txt(txt: str, input_type: str = 'words') → Doc[source]

Create a Spacy Doc() from text content.

Note: Words with length \(\leq\) 2 are filtered out.

Parameters

txt (str) – Text content to convert.
input_type (str) –
- 'words': txt contains only words separated by whitespaces. Useful in case of preprocessed text.
- 'sentences': txt contains sentences separated by commas. Useful in case of raw text.

Return type

Spacy Doc().

transform(corpus: list, attributes: list, max_tokens: int = 2000, input_type: str = 'words')[source]

Transform corpus of documents into scattertext format using attribute information.

Parameters

corpus (list) – List of documents.
attributes (list) – list (with same length as corpus) of attributes.
max_tokens (int) – Maximum number of tokens to keep.
input_type (str) –
- 'words': txt contains only words separated by whitespaces. Useful in case of preprocessed text.
- 'sentences': txt contains sentences separated by commas. Useful in case of raw text.

Return type

scattertext corpus.

Topic modelling

class compshs.text.TopicModeler(model_name: str = 'LDA', n_components: int = 10)[source]

Topic modeler.

Parameters

model_name (str) –
Model name.
- 'LDA', Latent Dirichlet Allocation.
- 'NMF', Non-Negative Matrix Factorization.
n_components (int) – Number of topics.

fit(matrix: Union[csr_matrix, ndarray]) → TopicModeler[source]

Fit algorithm to the document term matrix.

Parameters: matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).
Returns: self
Return type: TopicModeler

fit_transform(matrix: Union[csr_matrix, ndarray], *args, **kwargs) → Tuple[source]

Fit and transform data.

Parameters: matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).
Returns: Tuple of topic names and document topic distribution matrix (n_documents, n_components).
Return type: tuple

get_word_contributions() → ndarray[source]: Get matrix of word (n_components, n_words) contributions to each topic.

transform(matrix: Union[csr_matrix, ndarray]) → Tuple[source]

Transform data according to the fitted model.

Parameters: matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).
Returns: Tuple of topic names and document topic distribution matrix (n_samples, n_components).
Return type: tuple