Question

How to interpret a FASTA file

0

Entering edit mode

2.9 years ago

Student ▴ 30

Hello.

I would like to study how to analyze data from DNA sequencing experiments. Once such experiment is done, we obtain a file whose format is called FASTA. It is a text file where you have the reads and the sequences.

If I understood well, the read is a piece of DNA of the genome (I am talking about the sequencing of the entire genome).

My doubt is: the read is a piece of DNA located in a random position of the genome or it represents the sequence of a precise gene of the genome ?

If I download a fasta format file from NCBI, there is a line of description like this:

>lcl|CP003685.1_cds_AFN02977.1_1 [locus_tag=PFC_00005] [protein=hypothetical protein] [protein_id=AFN02977.1] [location=43..2532] [gbkey=CDS]

And then the sequence of nucleotides like this:

ATGAGGAAAAAACTTGTTGGAATATTGACAATATTGGTTGCTTTGGGCATGTTAGTAAGCCCACTTCTAA
AGCCAGTAGCAGCAGAGGATCAGAAGGTTCTTAAGATAGCAATGTACTCAGCAACTGGTTCTCTATTTAT
GGGTGCATGGAACCCAAGTTCAGCAGGTTTCAGAGATGTGTATTCAACTAGAGCTGCAGGGTTGGCCCAG
GATGAGGGAGCATACGTTTGGGGTATTGAGGGTGACTACCACCCATACAGATGTACCTTAGTTGAAGGTA
AGGAAAATGTAAAGGTACCAGAAACTGCTTTAGTCTTCAATACAACCACCAAGAAGTGGCAACCTGATCA
TGCTGGAGAAGTTGCTCCAACCGCGGCTACCTTCAAGTGCCAAAAGATCTACTTCCACGATGGCCACAAG
CTCACAGTTGCTGATGTAATGTACGGCTACTACTGGTCATGGGAGTGGTCAAGCCAAGATGGAGACCAAG
ATCCATACTTCGATGCAAACGAGGCTGACTGGAGCGCAGAAGCAATGCAAAAGCTCCTCGGTATTGAGGT
TAAGGAAGAAGACGATAATTACTTTGTAGTAACCATCTACCACACCTACACATTCCCACCCTACAAGAAG
TATCAATACTGGTACTTCACGCCCTACGCAAGCTATCCATGGCAACTCATTTATGCCATGAGCGAACTTG
TTGCCGAGAGCAACAGGGCTAGGTTTGCCAACCAGACTGAAGGTGTAGAATTGTTCTCATTCAGTGAATC
TACTGAAGACATTCAACAGATTGATATGCTAACACCTTCTCACGCTAAGAAGGTTGCTGAAATGCTTGAG
AAGTTGAAGAATGAGAAGCCAATACCTGACGTTATTAAGGACTTCATCTATGACGAGCAGGACGAGATTA
AGGAATATGACTCCATTATCAACTTTATAAACACTCACAACCACATGTTCATTTCAACTGGGCCATATCT
AATTGATGTCTACAAGCCTGAGAACCTCTATCTAAGGTATGTTAAGTTTGACAAGTGGGTCAAGCCAGAG
TTTGCTGAGGACATGTACAACTTTGAGCCATACTTCGATGTTGTAGAGCTTTATGGTATCCAGAACGAGA
ACACGATAATTCTTGGTGTAGCAAGTGGAGAGTACGATGTTTCATGGTACTCATTCCCATCATTCACGTT
CTCTGGACTTAGTGATGAGCAGAAGAGCAACATTGACATGTACGTTAACATTGGTGGATTCTGGGACATG
GTCTGGAACCCAGTGCACGACAAGGATAATCCATATGTGATTACAGTTGGTGACAAGAAGTACTTCAACC
CATTCGCAATTAGAGAGATAAGATTTGCAATGGAATACCTCATCAACAGAAACTACATCATCCAGAACAT
CCTCCAGGGTTCAGGTGGACCAATGTACACTCCATGGACAAGTGGTGATACGGTTGCAATCGAGAAGCTA
CAGCCAGTTGTCGATGCCTTTGGTATCGATGCACAGGGTGACGAAGAGTATGCTCTCCAGCTAATTGAAC
AGGCGATGCAAAAGGCCGCTAGAGAGTTAGCTAACATGGGATATGAGCTCAAGAAGGTTAACGGAAAGTG
GTACTTCAACGGAGAGCCAGTTAAGATCGTTGGAATTGGAAGACAAGAAGATGAGAGAAAGGATGAGGCT
TACTACATTGCAGAAATCCTTAGAAAGGCTGGATTTGAGGTTGAAGTTAAGATAGTTGACAGAAGAACTG
CCAACCAGATAGTATACCTCTCAGACCCAGCTAACTATGAATGGGGTTATTACACTGAAGGATGGGTAGC
AAGTGGAAGCGTTCTCTTCTCAATTAGCAGAATCCTACAGTACTACACCACAGCATGGTTTGGTCCAGGA
TTCGTAGGTTGGAAGTTCACACCAGAGAACACATACAGAGCAACAGTAGAAGAAGTCCTCAAGTATCTTG
GAAATGGTGACATTCAGGCAGCTATTGACATGCTTGAACTTGAGTACTACACCACTCCAGACAAGCTTGA
ACCAATACTTGACTGGACAGCAGATGATATCGGATGGCTTATCTACACAAGCAATTACAAGAACCAGACA
CTAGACTCTGAAGCTAAGTACTGGGACCTAACTAAGATTGGTGCTGCTATTGGTATCTACGAGAGCTTCA
GAGTCTTTACAGCAGAAACCTGGGAGTTCTTCCCAGTCAACAAGAGAATTAAATTCAGAGTTATGGATCC
AGCAGTTGGTCTAGGAAACAGCATCGTTATGAAGAGCGCCTACCTTGCTGAGGCTCCAGAGACACCAACC
CAGACTGAGACTACTACCACCCAGACCACTACAACTCAAACAACCACCACAACCCCATCACCAACCCAGA
CTCAACCGACTACTACACAATCTCCAACTGAGACTGGAGGAATCTGTGGACCAGCGATACTTGTTGGTCT
CGCAGTAGTTCCACTCCTCCTGAGAAGGTTTAAGAAGTAG

Intuitively, I interpret this as the sequence of the gene that encondes hypothetical protein.

Is this interpretation completely wrong?

Maybe my uncertainities could be a bit removed by studying better how the experiment is performed... but I am really curious to know if this explanation is good before studying hours and hours DNA sequencing experiments.

Thank you in advance.

Sequencing DNA • 2.5k views

ADD COMMENT • link updated 2.9 years ago by lieven.sterck 15k • written 2.9 years ago by Student ▴ 30

lieven.sterck · Accepted Answer · 2021-05-21

7

Entering edit mode

2.9 years ago

lieven.sterck 15k

FASTA is a general format to hold sequence data. Anything that is a sequence (nucleotide or protein) , so it could be reads from sequencing, assembled reads, protein sequences, peptides, .... can be stored in this format

If you are talking reads from a sequencing output, then yes, those are random pieces from the genome, if like in the example you gave it's a CDS then it is not random.

What you downloaded from NCBI seems to be a CDS sequence (biologically: the part of a mRNA molecule that will be translated into a protein). so yes your interpretation is correct here.

ADD COMMENT • link 2.9 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you very much lieven.sterck ! A little question about what you said about the biological meaning: in few words, the data that I reported is the sequence of the region of DNA that determines the sequence of amino acids in the protein ? The information about this sequence of DNA will be transferred to the messanger RNA with a process called transcription and then (after some passages) we have the protein ? My biological background are a bit poor...

ADD REPLY • link 2.9 years ago by Student ▴ 30

1

Entering edit mode

That is correct. When one sequences the entire genome, though, there is no guarantee that a single piece of randomly selected DNA represents a coding sequence. Coding sequences are those that end up being translated into proteins. Most genomes that from organisms bigger than microbes and viruses tend to have large stretches of DNA that are never going to be translated; those are called non-coding sequences. The human genome, for example, is made up almost primarily of DNA that is not going to be used for proteins (only about 3% of the genome corresponds to genes).

ADD REPLY • link 2.9 years ago by Friederike 8.9k

1

Entering edit mode

data that I reported is the sequence of the region of DNA that determines the sequence of amino acids in the protein

Potentially, yes. Reason it has been tagged as hypothetical protein is because while there is logical support for it potentially code a protein (e.g. starts with ATG has no stop codons until the end of that sequence etc) there is likely no experimental evidence (yet) that such a protein exists. This sequence also may not match any other known sequences in database that are known to code a protein.

ADD REPLY • link 2.9 years ago by GenoMax 141k

0

Entering edit mode

Thank you so much all ! But a new question now arises: how is a CDS sequence obtained ? I mean: we do the DNA sequencing experiment that gives a file with reads that are random pieces from the genome (as lieven.sterck said). But how we pass from this output file to the CDS where we know that each sequence reprents the sequence of a gene ? So it is like if it was ordered... Is it done by alignment of the reads with the reference genome ?

ADD REPLY • link 2.9 years ago by Student ▴ 30

1

Entering edit mode

Well, there are a number of process/analyses you have to got through. The first being assembly, where the goal is to paste all these random pieces of DNA (reads) back together to form the biological molecule they were derived from. Ideally you will get back the chromosomes of the organism (but in most cases you will however). Once you have the assembled sequence you need to do annotation , where the goal is to determine the location and structure of the genes. From those genes you can then extract the CDS sequences.

This is in a very small nutshell explained. The two processes mentioned here usually take up months (to years) of computational work and they inherently posses you many problems (or things to sort out) along the way.

ADD REPLY • link 2.9 years ago by lieven.sterck 15k

0

Entering edit mode

An important part of the information is also obtained by sequencing the RNA (as cDNA) .

ADD REPLY • link updated 2.9 years ago by lieven.sterck 15k • written 2.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Ok thank you so much lieven.sterck ...a last question: the process of alignment,instead, is done between the sequences of DNA of genomes (to study maybe something about the genomes in consideration) or is it refered to the alignment between reads and the reference genome so to do the process of annotation or maybe it is a term that is used for both things even though they are different actions/procedures ?

ADD REPLY • link 2.9 years ago by Student ▴ 30

1

Entering edit mode

alignment is a very general term and comes in very different forms and approaches. It's most general interpretation it denotes pairing two sequences. As such you can refer to alignment between reads and a genome (though more often referred to as 'mapping' nowadays), but also between genomes, between proteins and genomes, ...

ADD REPLY • link 2.9 years ago by lieven.sterck 15k

0

Entering edit mode

I applaud your enthusiasm, but I think you would benefit a lot from taking a step back, learning some basics (e.g. a textbook) and then start exploring sequencing data.