Question: What is the format of *.contigs.fasta files?
0
gravatar for Na Sed
3.5 years ago by
Na Sed290
United States
Na Sed290 wrote:

Hi everyone,

I am given a file which its name is AA.contigs.fasta. The first lines of this file are like the below:

>tig00000000 len=1940327 reads=4609 covStat=3434.17 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
ATCTGCTTCATCCGCATCGAATCACGGGCACTCAGATGATCTCTAGGGCACGACCTAAACCCACCTGACGCGCCATACGAGATGCACCTCCGCCACAAGG
GAAGATGCCCATACCCACTTCCATCTGCATGAATTTGTATTTACCGCGAGCGGCAAAACGCATGTCACAGGCCAAGCAAATTCATGACCGCCACCACGGG
ACAAGCCTTCTAGCTTCGCAATTGTAGCTTGTGGTAGCTTGCTGATTCTTTCGAGAACCGCTTGAAGATCGAGTAACTTCGCTTCCTCGCGAGAAACGGC
CTCTGTCGACATCTCTTTAAGCAATTCGGTATCGTAATGACAAACCCAAATCTCCGGGTTGGCTGATTGGAATACAACCACTTTGACACTACGATCACGT
TCTAATCGCAGTGCTAACCCATTCAAATCCGCCAACATTTCCTGCCCTTGCACGTTTACCGTACCAAAATCAAACGTGACATAAAGAATTGCGTCTTCTT
GCTTTGCAGTAAACGTTTTGTAGCCTTCGTAAGCCATATCCATTTCCTTTTTCCAATAAAATCACTAGGTTGCTATTTTTCAAAGCAACGCAATTAACGT
TACGCCTCTAAAAAACATCAAACAATGACGCATAAAAAGAAACAGTATCTACGAAAACTAAAAGGTGATTTCCTCAATAACGGCTAGCAACAAATCACGT

1- Could you please tell me about the format of the file? For example, what is each row? How this file is obtained? What is the meaning of info in the first line?

2- By given this file, how can I calculate the total genomic length of the assembly?

3- Do you know any reference about this material? I am completely unfamiliar with this stuff and wanna learn.

Thank you.

contig next-gen fasta • 1.8k views
ADD COMMENTlink modified 3.5 years ago by genomax80k • written 3.5 years ago by Na Sed290
1

Wiki is often a good place to start:

https://en.wikipedia.org/wiki/FASTA_format

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k

What is the role of 'contigs' in the name of file? Also, I have only one file for each genome and the number of rows in this file is ~60,000 lines. All lines except the first line include A,C, G, and T.

ADD REPLYlink written 3.5 years ago by Na Sed290

Did you check the FASTA_format WikiPedia link provided by @Brian above.

  1. How this file was obtained is hard to say but if the contig in the file name means what it should then it was likely produced by a sequence assembly program.
  2. 1940327 is the length of the piece you posted above.

Number of lines/rows has no special meaning. The DNA sequence is a continuous string. It has likely been split across multiple lines ("rows" that you are referring to) for ease of display.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by genomax80k

In one line description of the file, it has been written that it is de novo assembled genome. In this case, what is the number of contigs? does it equal to the number of rows?

ADD REPLYlink written 3.5 years ago by Na Sed290

The number of contigs is the number of headers. Each starts with a '>' symbol.

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k

@Brian There is no '>' character in the file. Please see one of the file through the DropBox link: https://www.dropbox.com/s/2f1scrsgk0n2p4c/lbc26_ABCD.contigs.fasta?dl=0

ADD REPLYlink written 3.5 years ago by Na Sed290

If there is no '>' character it is not a fasta file. Your example in the first post starts with >.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 978 users visited in the last hour