Text

Tools for textual analysis.

Preprocessing

class compshs.text.Preprocess(lang: str = 'en_core_web_sm', exclude_stop_words: bool = True, exclude_punctuation: bool = True, exclude_numbers: bool = True, lemmatize: bool = True, batch_size: int = 10, chunk_size: int = 500000)[source]

Preprocessing of a corpus of documents.

Parameters
  • lang (str) – Spacy language model name ('en_core_web_sm').

  • exclude_stop_words (bool) – If True, exclude stopwords (default).

  • exclude_punctuation (bool) – If True, exclude punctuation (default).

  • exclude_numbers (bool) – If True, exclude numbers (default).

  • lemmatize (bool) – If True, lemmatize tokens (default).

  • batch_size (int) – Number of documents to process in each batch (default = 10).

  • chunk_size (int) – Maximum length of a piece of text. Beyond this length, the document is divided into chunks (default = 500000).

  • nlp – Spacy model build upon lang parameter.

fit()[source]

Fit algorithm to the data.

transform(corpus: list) list[source]
Preprocess corpus:
  • remove stopwords

  • remove punctuation

  • remove numbers

  • extract lemmatized tokens

  • set tokens in lowercase

Parameters

corpus (list) – List of documents.

Returns

List of preprocessed documents.

Return type

list

Frequency

class compshs.text.FrequencyCounter(vectorizer_name: str = 'tf', ngram_range: Tuple = (1, 1), analyzer: str = 'word', max_df: Union[float, int] = 1.0, min_df: Union[float, int] = 1)[source]

Counter of frequencies over corpus.

Parameters
  • vectorizer_name (str) –

    Vectorizer name.
    • 'token', Count of token occurrences over documents in corpus.

    • 'tfidf', Tf-idf count over documents in corpus.

  • ngram_range (Tuple) –

  • analyzer (str) –

  • max_df (float or int) –

  • min_df (float or int) –

fit(corpus: list) FrequencyCounter[source]

Fit algorithm to the corpus.

Parameters

corpus (list) – List of (preprocessed) documents.

Returns

self

Return type

FrequencyCounter()

fit_transform(corpus: list, *args, **kwargs)[source]

Fit and transform data.

Parameters

corpus (list) – List of (preprocessed) documents.

Returns

Tuple of token names and frequencies matrix.

Return type

tuple

get_token_names() ndarray[source]

Get token names.

Returns

Array of token names.

Return type

np.ndarray

transform(corpus: list) Tuple[source]

Compute frequencies over (preprocessed) corpus.

Parameters

corpus (list) – List of (preprocessed) documents.

Returns

Tuple of token names and frequencies matrix.

Return type

tuple

Feature selection

class compshs.text.FeatureSelection[source]

Feature selection.

get_df_from_corpus(corpus: list, attributes: list) DataFrame[source]

Convert a list of documents with attribute information into a pandas DataFrame object.

Parameters
  • corpus (list) – List of documents.

  • attributes (list) – list (with same length as corpus) of attributes.

Returns

DataFrame with corpus information.

Return type

pd.DataFrame()

spacy_doc_from_txt(txt: str, input_type: str = 'words') Doc[source]

Create a Spacy Doc() from text content.

Note: Words with length \(\leq\) 2 are filtered out.

Parameters
  • txt (str) – Text content to convert.

  • input_type (str) –

    • 'words': txt contains only words separated by whitespaces. Useful in case of preprocessed text.

    • 'sentences': txt contains sentences separated by commas. Useful in case of raw text.

Return type

Spacy Doc().

transform(corpus: list, attributes: list, max_tokens: int = 2000, input_type: str = 'words')[source]

Transform corpus of documents into scattertext format using attribute information.

Parameters
  • corpus (list) – List of documents.

  • attributes (list) – list (with same length as corpus) of attributes.

  • max_tokens (int) – Maximum number of tokens to keep.

  • input_type (str) –

    • 'words': txt contains only words separated by whitespaces. Useful in case of preprocessed text.

    • 'sentences': txt contains sentences separated by commas. Useful in case of raw text.

Return type

scattertext corpus.

Topic modelling

class compshs.text.TopicModeler(model_name: str = 'LDA', n_components: int = 10)[source]

Topic modeler.

Parameters
  • model_name (str) –

    Model name.
    • 'LDA', Latent Dirichlet Allocation.

    • 'NMF', Non-Negative Matrix Factorization.

  • n_components (int) – Number of topics.

fit(matrix: Union[csr_matrix, ndarray]) TopicModeler[source]

Fit algorithm to the document term matrix.

Parameters

matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).

Returns

self

Return type

TopicModeler

fit_transform(matrix: Union[csr_matrix, ndarray], *args, **kwargs) Tuple[source]

Fit and transform data.

Parameters

matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).

Returns

Tuple of topic names and document topic distribution matrix (n_documents, n_components).

Return type

tuple

get_word_contributions() ndarray[source]

Get matrix of word (n_components, n_words) contributions to each topic.

transform(matrix: Union[csr_matrix, ndarray]) Tuple[source]

Transform data according to the fitted model.

Parameters

matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).

Returns

Tuple of topic names and document topic distribution matrix (n_samples, n_components).

Return type

tuple