data
Tools for loading corpus.
Parse
- compshs.data.from_directory(directory_path: str, dataset_name: Optional[str] = None) Dataset[source]
Load a corpus from a directory containing text files.
- Parameters
directory_path (str) – Path to the directory containing text files (.txt).
dataset_name (str) – Name of the returned Dataset. Directory name is used if not specified.
- Returns
dataset
- Return type
Dataset
Example
>>> directory_path = 'path_to_txt_files' >>> dataset = from_directory(directory_path) >>> dataset.name 'path_to_txt_files'
- compshs.data.from_sql(database_path: str, dataset_name: str, table_name: str, document_column_name: str, document_name_field: str, document_text_field: str, filter_field: Optional[str] = None, excluded_value: Optional[str] = None, included_value: Optional[str] = None) Dataset[source]
Load a corpus from a sqlite database where documents are stored as JSON entries.
- Parameters
database_path (str) – Path to the .db file.
dataset_name (str) – Name of the returned Dataset.
table_name (str) – Name of the database table containing JSON documents.
document_column_name (str) – Name of the table column containing JSON documents.
document_name_field (str) – Name of the JSON field containing document label.
document_text_field (str) – Name of the JSON field containing document textual content.
filter_field (str, optional) – JSON key to filter by (default=None).
excluded_value (str, optional) – Value to exclude from filtering the filter_field key (default=None).
included_value (str, optional) – Value to include from filtering the filter_field key (default=None).
- Return type
Dataset
- compshs.data.get_query(table_name: str, document_column_name: str, document_name_field: str, document_text_field: str, filter_field: Optional[str] = None, excluded_value: Optional[str] = None, included_value: Optional[str] = None) Tuple[source]
Generate a parametrized SQL query.
- Parameters
table_name (str) – Name of the database table containing JSON documents.
document_column_name (str) – Name of the table column containing JSON documents.
document_name_field (str) – Name of the JSON field containing document label.
document_text_field (str) – Name of the JSON field containing document textual content.
filter_field (str, optional) – JSON key to filter by (default=None).
excluded_value (str, optional) – Value to exclude from filtering the filter_field key (default=None).
included_value (str, optional) – Value to include from filtering the filter_field key (default=None).
- Returns
Tuple of query and parameters.
- Return type
tuple