Hello everyone!
I've been looking without any luck information about different data formats; its purpose, its intended use, advantages, benefits, disadvantages and why they exist. Also, which one is the most complete, practical, usable or important to correlate diseases with the DNA information.
Formats:
FASTQ format
FASTA format
EMBL format
GCG format
GenBank format
IG format
Of those formats, the only two that are supported by most software are fastq and fasta. Of those, fastq is usually better for sequencer reads since it contains base quality. Also, it compresses better than fasta (if there is not quality information), though the raw files are larger, and the space taken in memory may be larger.
I don't know of any use for the listed formats other than fasta and fastq. They look like they were designed for human readability, with little thought given to machine processing of data. As a result they're no longer relevant unless you are using a specific legacy program requiring that format, or downloading records from a public database stored in that format. This is still very common; there are many useful records stored in public databases in obsolete formats. They are basically formats you may need to translate out of to do processing, but not formats you would translate into.
Before the GUI based desktop programs (and powerful enough desktop hardware) we used GCG Wisconsin Package (and EMBOSS) for doing serious bioinformatics analyses on Unix. GCG was purchased by Accelrys at one point. I don't think it is available/supported anymore and may have been subsumed by Pipeline Pilot from Accelrys..
Sorry, I didn't explain it well. What I want to say is if I want to create a program to compare the sequence with the sequence of a disease, which format will give me the information that I need?
What I want to to do is something like promethease, where I have my DNA and I can know which disease I could have. So having a database of disease I can verify each DNA sequence.
If it was that simple (even for mendelian disorders) then it would have already been standard codified practice. Because it is not it should give you an idea of the complexity of the task.
I am not sure what part of the world you are from (local regulations for diagnostic sequencing may or may not exist) but including DNA sequence in EMR requires very specific procedures in places where such regulations exist.
Note: I should ask if the diagnostic information is going to be provided to you (by a licensed authority) and you are only looking to include that information in your record. That would be a different situation.
The diagnostic information will be include in the patient record, I do not have all the information of exactly how it will be managed. I've been working on this project just a few weeks, my task is to find which formats the sequence will be received.
Here is some general information (not specifically from diagnostic sequencing point of view).
Standard output from high-throughput sequencers is fastq format files. You can read about fastq format here.
If you are getting data from Sanger sequencing (capillary sequencing) then the data may either be in original/raw .ab1format or could be converted to a fasta file. There is no defined standard format for fasta but in its simplest form it would be something like this
>Identifier
DNA sequence - can be on multiple lines
If you receive data in an aligned format - If fastq files were used as input then it will be in SAM/BAM format. If fasta files were used for alignments then it may be it may be in different formats depending on program used to produce those alignments.
It indeed isn't primary, but it's the only format that I would think has clinical implications. I can't imagine a clinician downloading the fastq files, quickly aligning it, do some variant calling stuff and talk about that mutation you are carrying in relevant geneA.
For that you would need properly annotated vcf files.
Sure. But so far the discussion has been for primary sequence data.
I think we have given @zuleyka enough information to think about. It is unclear what format data @zuleyka is going to receive or if any processing/analysis is expected (i.e. the data would be raw/derived results).
@zuleyka: Once you find out some of this information come back and post any follow-up questions.
Of those formats, the only two that are supported by most software are fastq and fasta. Of those, fastq is usually better for sequencer reads since it contains base quality. Also, it compresses better than fasta (if there is not quality information), though the raw files are larger, and the space taken in memory may be larger.
It looks like there is a description of the formats here: https://www.genomatix.de/online_help/help/sequence_formats.html
I don't know of any use for the listed formats other than fasta and fastq. They look like they were designed for human readability, with little thought given to machine processing of data. As a result they're no longer relevant unless you are using a specific legacy program requiring that format, or downloading records from a public database stored in that format. This is still very common; there are many useful records stored in public databases in obsolete formats. They are basically formats you may need to translate out of to do processing, but not formats you would translate into.