Module: ncbi.datasets.package.dataset

Representations of a downloaded NCBI Datasets Package.

Module: ncbi.datasets.package.dataset

Representations of a downloaded NCBI Datasets Package.

NCBI Datasets provides data in ZipArchives for Genome, Gene, Pathogen and Virus resources. These classes each contain dataset catalogs that help programmatically determine the file contents.

Examples

A quickstart is to download a package, and then create a generic Dataset wrapper:

>>> from ncbi.datasets.package.dataset import get_dataset_from_file

package = get_dataset_from_file(path_to_file) for report in package.get_data_reports():

# do something with the protobuf report object

ncbi.datasets.package.dataset.get_dataset_from_file(zip_file_or_directory: str, dataset_type: str) ncbi.datasets.package.dataset.Dataset

Create a Dataset-derived object of type ‘dataset_type’ and return it.

Returns

A subclass of the class ‘Dataset’ as specified by the caller.

class ncbi.datasets.package.dataset.Dataset(zipfile_or_directory: str)

Bases: object

Base class to extract files from datasets package

Functions to extract files from a datasets package based on file names and types in the packages catalog file

is_zipped() bool

Return True if the dataset is stored in a zip file

get_file_root_dir() str

Return the data directory within the dataset (e.g. ncbi_dataset/data)

get_catalog() Dict[str, Any]

Return the datasets file catalog as a dictionary

get_file_names_by_type(file_type: str) List[str]

Return names of all files of type ‘file_type’, e.g. ‘PROTEIN_FASTA’

get_files_by_type(file_type: str) Iterator[Tuple[str, str]]

Return contents of all files of type ‘file_type’ along with their names

get_file_handles_by_type(file_type: str) Iterator[Tuple[TextIO, str]]

Return file handles for all files of type ‘file_type’ along with their names

get_file_types() List[str]

Return all file types available in the current dataset

get_file_content(file_name: str) str

Return full text of file ‘file_name’

get_file_handle(file_name: str) TextIO

Get handle of file using name within dataset directory

Parameters

file_name – Name of file within the data directory, e.g. if the full datasets path is ncbi_dataset/data/GCF_000001405.39/chrX.fna, file_name should be GCF_000001405.39/chrX.fna

Returns

Handle to the specified file

stream_reports(file_type: str, protobuf_report_type: Any) Any

Retrieve report records defined via protobuf schema from jsonl files.

Parameters
  • file_type – The type of file from the dataset catalog, e.g. ‘DATA_REPORT’ or ‘SEQUENCE_REPORT’.

  • protobuf_report_type – Schema, defined using GRPC protobuf, for the current dataset and file type.

Returns

Yields a set of protobuf objects for the dataset and file type.

class ncbi.datasets.package.dataset.AssemblyDataset(zipfile_or_directory: str)

Bases: ncbi.datasets.package.dataset.Dataset

Retrieve Assembly reports

Methods to read Assembly and Assembly Sequence reports

get_data_reports() Iterator[ncbi.datasets.v1.reports.assembly_pb2.AssemblyDataReport]

Retrieve assembly reports

Returns

Yields a set of AssemblyDataReport protobuf objects

get_sequence_reports() Iterator[ncbi.datasets.v1.reports.assembly_sequence_info_pb2.SequenceInfo]

Retrieve assembly sequence reports

Returns

Yields a set of Assembly SequenceInfo protobuf objects

class ncbi.datasets.package.dataset.GeneDataset(zipfile_or_directory: str)

Bases: ncbi.datasets.package.dataset.Dataset

Retrieve Gene reports

Methods to read Gene reports

get_data_reports() Iterator[ncbi.datasets.v1.reports.gene_pb2.GeneDescriptor]

Retrieve a gene report object

Returns

Yields a set of GeneDescriptor protobuf objects

class ncbi.datasets.package.dataset.VirusDataset(zipfile_or_directory: str)

Bases: ncbi.datasets.package.dataset.Dataset

Retrieve Virus reports

Methods to read Virus reports

get_data_reports() Iterator[ncbi.datasets.v1.reports.virus_pb2.VirusAssembly]

Retrieve virus assembly objects

Returns

Yields a set of virus assembly report protobuf objects

class ncbi.datasets.package.dataset.MicrobiggeDataset(zipfile_or_directory: str)

Bases: ncbi.datasets.package.dataset.Dataset

Retrieve MicroBiggee pathogen reports

Methods to read MicroBiggee reports

get_data_reports() Iterator[ncbi.datasets.v1.reports.microbigge_pb2.MicroBiggeReport]

Retrieve MicroBigge data report objects

Returns

Yields a set of MicroBigge report protobuf objects

Generated October 22, 2021