Text
Tools for textual analysis.
Preprocessing
- class compshs.text.Preprocess(lang: str = 'en_core_web_sm', exclude_stop_words: bool = True, exclude_punctuation: bool = True, exclude_numbers: bool = True, lemmatize: bool = True, batch_size: int = 10, chunk_size: int = 500000)[source]
Preprocessing of a corpus of documents.
- Parameters
lang (str) – Spacy language model name (
'en_core_web_sm').exclude_stop_words (bool) – If
True, exclude stopwords (default).exclude_punctuation (bool) – If
True, exclude punctuation (default).exclude_numbers (bool) – If
True, exclude numbers (default).lemmatize (bool) – If
True, lemmatize tokens (default).batch_size (int) – Number of documents to process in each batch (default = 10).
chunk_size (int) – Maximum length of a piece of text. Beyond this length, the document is divided into chunks (default = 500000).
nlp – Spacy model build upon
langparameter.
Frequency
- class compshs.text.FrequencyCounter(vectorizer_name: str = 'tf', ngram_range: Tuple = (1, 1), analyzer: str = 'word', max_df: Union[float, int] = 1.0, min_df: Union[float, int] = 1)[source]
Counter of frequencies over corpus.
- Parameters
vectorizer_name (str) –
- Vectorizer name.
'token', Count of token occurrences over documents in corpus.'tfidf', Tf-idf count over documents in corpus.
ngram_range (Tuple) –
analyzer (str) –
max_df (float or int) –
min_df (float or int) –
- fit(corpus: list) FrequencyCounter[source]
Fit algorithm to the corpus.
- Parameters
corpus (list) – List of (preprocessed) documents.
- Returns
self
- Return type
- fit_transform(corpus: list, *args, **kwargs)[source]
Fit and transform data.
- Parameters
corpus (list) – List of (preprocessed) documents.
- Returns
Tuple of token names and frequencies matrix.
- Return type
tuple
Feature selection
- class compshs.text.FeatureSelection[source]
Feature selection.
- get_df_from_corpus(corpus: list, attributes: list) DataFrame[source]
Convert a list of documents with attribute information into a pandas DataFrame object.
- Parameters
corpus (list) – List of documents.
attributes (list) – list (with same length as
corpus) of attributes.
- Returns
DataFrame with corpus information.
- Return type
pd.DataFrame()
- spacy_doc_from_txt(txt: str, input_type: str = 'words') Doc[source]
Create a Spacy
Doc()from text content.Note: Words with length \(\leq\) 2 are filtered out.
- Parameters
txt (str) – Text content to convert.
input_type (str) –
'words':txtcontains only words separated by whitespaces. Useful in case of preprocessed text.'sentences':txtcontains sentences separated by commas. Useful in case of raw text.
- Return type
Spacy
Doc().
- transform(corpus: list, attributes: list, max_tokens: int = 2000, input_type: str = 'words')[source]
Transform corpus of documents into scattertext format using attribute information.
- Parameters
corpus (list) – List of documents.
attributes (list) – list (with same length as
corpus) of attributes.max_tokens (int) – Maximum number of tokens to keep.
input_type (str) –
'words':txtcontains only words separated by whitespaces. Useful in case of preprocessed text.'sentences':txtcontains sentences separated by commas. Useful in case of raw text.
- Return type
scattertextcorpus.
Topic modelling
- class compshs.text.TopicModeler(model_name: str = 'LDA', n_components: int = 10)[source]
Topic modeler.
- Parameters
model_name (str) –
- Model name.
'LDA', Latent Dirichlet Allocation.'NMF', Non-Negative Matrix Factorization.
n_components (int) – Number of topics.
- fit(matrix: Union[csr_matrix, ndarray]) TopicModeler[source]
Fit algorithm to the document term matrix.
- Parameters
matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).
- Returns
self
- Return type
- fit_transform(matrix: Union[csr_matrix, ndarray], *args, **kwargs) Tuple[source]
Fit and transform data.
- Parameters
matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).
- Returns
Tuple of topic names and document topic distribution matrix (n_documents, n_components).
- Return type
tuple
- get_word_contributions() ndarray[source]
Get matrix of word (n_components, n_words) contributions to each topic.
- transform(matrix: Union[csr_matrix, ndarray]) Tuple[source]
Transform data according to the fitted model.
- Parameters
matrix (sparse.csr_matrix, np.ndarray) – Document term matrix (n_documents, n_words).
- Returns
Tuple of topic names and document topic distribution matrix (n_samples, n_components).
- Return type
tuple