Question: Can one get the entire genome by putting all contigs together in this particular case?
gravatar for John Smith
6.0 years ago by
John Smith280
United States
John Smith280 wrote:

I am still new to bioinformatics and I have not yet fully understood the definition of contig. I have read a few explanations and what I understand is that contigs are fragments of the genome for which we are certain that the order of the bases is correct. Then, we make scaffolds out of the contigs and the goal is to get one scaffold to represent the entire genome.

Right now, I am trying to obtain the full reference genome in FASTA format of Streptococcus pneumoniae BR1064. I found this at ENA and in the top right category under "Send Feedback" it appears "Genome Representation: full". From there, one can get over to the assembly contig and there are 245 contigs. Can I just put all this contigs together and obtain the full genome of the organism? If so, is there a particular way to do it? Should it just be in increasing numerical order?

contig sequence genome • 3.8k views
ADD COMMENTlink modified 6.0 years ago by Philipp Bayer6.7k • written 6.0 years ago by John Smith280
gravatar for Philipp Bayer
6.0 years ago by
Philipp Bayer6.7k
Philipp Bayer6.7k wrote:

You're right that contigs are just fragments of the genome, and that scaffolding is the next step in assembly. Usually, this is done using genetic maps and SNPs (or other markers) so that the contigs can be anchored along that genetic map. Here's one recent example with the Brassica oleracea genome: The Brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes

Looking at the publication for your genome, it doesn't look like they performed scaffolding, maybe because there is nothing to use as a reference to scaffold against. Therefore, it is likely that the 245 contigs are just numbered by the order they fell out of the assembly program, an order which doesn't reflect the 'real' genome. In that case, I wouldn't concatenate them.

What do you want to do with the contigs? If it's just SNP calling or something like that, I would leave the sequences as contigs.

ADD COMMENTlink modified 12 months ago by RamRS27k • written 6.0 years ago by Philipp Bayer6.7k

I am trying to align raw sequencer data to the genome of BR1064. But it appears that there is no reference genome so I thought that maybe it was possible to just obtain it by putting all the contigs together.

ADD REPLYlink written 6.0 years ago by John Smith280

What aligner are you using? Most of them (BWA, bowtie2) can handle several reference sequences, so if you just leave the 245 sequences in one big fasta file and align your reads against that file, you should be alright.

Of course you might see some weird things like the first read of a pair aligning on contig_1 and the second read of a pair aligning on contig_2, but then you know that these two contigs are closeby. That's why software like GapFiller exists!

ADD REPLYlink written 6.0 years ago by Philipp Bayer6.7k

I am aligning the reads with Bowtie 2. Would I need to specify a certain parameter for Bowtie 2 since the FASTA file will contain several sequences (contigs in this case)? Or will it automatically recognize that there are several sequences in the FASTA file and attempt to align the reads?

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by John Smith280

That's good! bowtie2-build doesn't take any special arguments for several sequences (as far as I remember), so just do:

bowtie2-build your_sequences.fasta the_name_you_want

and then use the name you want as the reference in subsequent bowtie2 runs.

ADD REPLYlink modified 12 months ago by RamRS27k • written 6.0 years ago by Philipp Bayer6.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1138 users visited in the last hour