-
Notifications
You must be signed in to change notification settings - Fork 98
OpenCGA Storage Overview
OpenCGA Storage is the component in charge of ingest and provide a query language over biological BigData. The use cases of this component is provide a source of data for analysis like [....] or visualization in compatible viewers like GenomeMaps. This storage-engine handles the most common file formats for NGS.
There are an increasing number of biological formats supported by OpenCGA related with a common NGS pipeline. Within this formats, we focus on Genomic Variants due to the complexity and analysis capabilities.
Handling genomic variants is a main goal for OpenCGA due to its importance in NGS analysis. Allowing a fast reading and filtering for variants will speed up analysis, with fastest and more accurate results. The variants storage in OpenCGA has this properties:
-
Study or Dataset oriented
Where you can have multiple sets of data for the same species not necessarily related with each other in the same database. The data is maintained in the same storage, but not merged.
- Cohort definition
- Variant annotation
More information at Variant Storage Engine
- Coverage calculation
- Alignment stats
More information at Alignment Storage Engine
- Sequence (fastA)
- Feature formats (GFF, BED, BigWIG...)
Despite the internal implementation is not using directly the GA4GH models (given that the models are still uneatable and some mandatory features for OpenCGA are missing), OpenCGA-Storage is compatible with GA4GH implementing the APIs and being able to produce or consume the ga4gh models.
Based on a common definition of the Storage Engines, depending on the requirements and the resources of the study, there are multiple implementations using different technologies. This document specifies the core functionality that all the implementations must share. Technical details or customization parameters are explained in the plugin specific section.
Index pipeline is the process of ingesting data into an OpenCGA-Storage backend. To simplify the data management, all the index pipelines have been designed following the same schema.
Read more information about index pipelines at Index Pipelines.
All the configuration needed to work with is centralized in the storage-configuration.yml
file, usually located in the configuration folder.
This file contains the configuration for all the storage engines, the database connections and other cellbase and server information.
In the Storage Configuration you can find an extended explanation of the file structure and all the parameters.
There is a direct command line to interact with OpenCGA storage. This command line is not intended to be used in a full featured installation because is not connected to OpenCGA Catalog.
Use only for development or testing purposes.
See more information at Storage Command Line
OpenCGA is an open source project and it is freely available.
General
- Home
- Architecture
- Data Models
- RESTful Web Services
- Configuration
- Download and Installation
- Tutorials
OpenCGA Catalog
OpenCGA Storage
About