Question

DNA sequence file format

0

Entering edit mode

7.1 years ago

213 • 0

Hello everyone! I've been looking without any luck information about different data formats; its purpose, its intended use, advantages, benefits, disadvantages and why they exist. Also, which one is the most complete, practical, usable or important to correlate diseases with the DNA information. Formats: FASTQ format FASTA format EMBL format GCG format GenBank format IG format

sequence next-gen sequencing • 2.6k views

ADD COMMENT • link 7.1 years ago by 213 • 0

0

Entering edit mode

Of those formats, the only two that are supported by most software are fastq and fasta. Of those, fastq is usually better for sequencer reads since it contains base quality. Also, it compresses better than fasta (if there is not quality information), though the raw files are larger, and the space taken in memory may be larger.

It looks like there is a description of the formats here: https://www.genomatix.de/online_help/help/sequence_formats.html

I don't know of any use for the listed formats other than fasta and fastq. They look like they were designed for human readability, with little thought given to machine processing of data. As a result they're no longer relevant unless you are using a specific legacy program requiring that format, or downloading records from a public database stored in that format. This is still very common; there are many useful records stored in public databases in obsolete formats. They are basically formats you may need to translate out of to do processing, but not formats you would translate into.

ADD REPLY • link 7.1 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2017-02-27

1

Entering edit mode

7.1 years ago

GenoMax 141k

Before the GUI based desktop programs (and powerful enough desktop hardware) we used GCG Wisconsin Package (and EMBOSS) for doing serious bioinformatics analyses on Unix. GCG was purchased by Accelrys at one point. I don't think it is available/supported anymore and may have been subsumed by Pipeline Pilot from Accelrys..

EMBOSS has a comparison table of various sequence data formats. It should cover the ones you are looking for.

Also, which one is the most complete, practical, usable or important to correlate diseases with the DNA information.

Is this an assignment question? The question does not make a lot of sense in the context of sequence data formats you are referring to.

ADD COMMENT • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks, Genomax; I've learned a lot from the EMBOSS link.

ADD REPLY • link 7.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Sorry, I didn't explain it well. What I want to say is if I want to create a program to compare the sequence with the sequence of a disease, which format will give me the information that I need?

ADD REPLY • link 7.1 years ago by 213 • 0

0

Entering edit mode

That would be fasta.

ADD REPLY • link 7.1 years ago by Brian Bushnell 20k

0

Entering edit mode

the sequence of a disease

What does that mean?

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

What I want to to do is something like promethease, where I have my DNA and I can know which disease I could have. So having a database of disease I can verify each DNA sequence.

ADD REPLY • link 7.1 years ago by 213 • 0

0

Entering edit mode

The main purpose of the program is an Electronic Medical Record but we want to incorporate the DNA sequence.

ADD REPLY • link 7.1 years ago by 213 • 0

0

Entering edit mode

If it was that simple (even for mendelian disorders) then it would have already been standard codified practice. Because it is not it should give you an idea of the complexity of the task.

I am not sure what part of the world you are from (local regulations for diagnostic sequencing may or may not exist) but including DNA sequence in EMR requires very specific procedures in places where such regulations exist.

Note: I should ask if the diagnostic information is going to be provided to you (by a licensed authority) and you are only looking to include that information in your record. That would be a different situation.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

The diagnostic information will be include in the patient record, I do not have all the information of exactly how it will be managed. I've been working on this project just a few weeks, my task is to find which formats the sequence will be received.

ADD REPLY • link 7.1 years ago by 213 • 0

0

Entering edit mode

Here is some general information (not specifically from diagnostic sequencing point of view).

Standard output from high-throughput sequencers is fastq format files. You can read about fastq format here.
If you are getting data from Sanger sequencing (capillary sequencing) then the data may either be in original/raw .ab1 format or could be converted to a fasta file. There is no defined standard format for fasta but in its simplest form it would be something like this

>Identifier
DNA sequence - can be on
multiple lines
If you receive data in an aligned format - If fastq files were used as input then it will be in SAM/BAM format. If fasta files were used for alignments then it may be it may be in different formats depending on program used to produce those alignments.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

I would have thought vcf is the most relevant format for an EMR?

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes I thought in VCF, but I didn't find that format in many sequence machines specifications.

ADD REPLY • link 7.1 years ago by 213 • 0

0

Entering edit mode

VCF is derived from analysis of sequence files (should it be considered a sequence format?). It is not primary output from a sequencing run.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

It indeed isn't primary, but it's the only format that I would think has clinical implications. I can't imagine a clinician downloading the fastq files, quickly aligning it, do some variant calling stuff and talk about that mutation you are carrying in relevant geneA.

For that you would need properly annotated vcf files.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

1

Entering edit mode

Sure. But so far the discussion has been for primary sequence data.

I think we have given @zuleyka enough information to think about. It is unclear what format data @zuleyka is going to receive or if any processing/analysis is expected (i.e. the data would be raw/derived results).

@zuleyka: Once you find out some of this information come back and post any follow-up questions.