data

Tools for loading corpus.

Parse

compshs.data.from_directory(directory_path: str, dataset_name: Optional[str] = None) Dataset[source]

Load a corpus from a directory containing text files.

Parameters
  • directory_path (str) – Path to the directory containing text files (.txt).

  • dataset_name (str) – Name of the returned Dataset. Directory name is used if not specified.

Returns

dataset

Return type

Dataset

Example

>>> directory_path = 'path_to_txt_files'
>>> dataset = from_directory(directory_path)
>>> dataset.name
'path_to_txt_files'
compshs.data.from_sql(database_path: str, dataset_name: str, table_name: str, document_column_name: str, document_name_field: str, document_text_field: str, filter_field: Optional[str] = None, excluded_value: Optional[str] = None, included_value: Optional[str] = None) Dataset[source]

Load a corpus from a sqlite database where documents are stored as JSON entries.

Parameters
  • database_path (str) – Path to the .db file.

  • dataset_name (str) – Name of the returned Dataset.

  • table_name (str) – Name of the database table containing JSON documents.

  • document_column_name (str) – Name of the table column containing JSON documents.

  • document_name_field (str) – Name of the JSON field containing document label.

  • document_text_field (str) – Name of the JSON field containing document textual content.

  • filter_field (str, optional) – JSON key to filter by (default=None).

  • excluded_value (str, optional) – Value to exclude from filtering the filter_field key (default=None).

  • included_value (str, optional) – Value to include from filtering the filter_field key (default=None).

Return type

Dataset

compshs.data.get_query(table_name: str, document_column_name: str, document_name_field: str, document_text_field: str, filter_field: Optional[str] = None, excluded_value: Optional[str] = None, included_value: Optional[str] = None) Tuple[source]

Generate a parametrized SQL query.

Parameters
  • table_name (str) – Name of the database table containing JSON documents.

  • document_column_name (str) – Name of the table column containing JSON documents.

  • document_name_field (str) – Name of the JSON field containing document label.

  • document_text_field (str) – Name of the JSON field containing document textual content.

  • filter_field (str, optional) – JSON key to filter by (default=None).

  • excluded_value (str, optional) – Value to exclude from filtering the filter_field key (default=None).

  • included_value (str, optional) – Value to include from filtering the filter_field key (default=None).

Returns

Tuple of query and parameters.

Return type

tuple