Question

Can one get the entire genome by putting all contigs together in this particular case?

1

Entering edit mode

11.0 years ago

John Smith ▴ 320

I am still new to bioinformatics and I have not yet fully understood the definition of contig. I have read a few explanations and what I understand is that contigs are fragments of the genome for which we are certain that the order of the bases is correct. Then, we make scaffolds out of the contigs and the goal is to get one scaffold to represent the entire genome.

Right now, I am trying to obtain the full reference genome in FASTA format of Streptococcus pneumoniae BR1064. I found this at ENA and in the top right category under "Send Feedback" it appears "Genome Representation: full". From there, one can get over to the assembly contig and there are 245 contigs. Can I just put all this contigs together and obtain the full genome of the organism? If so, is there a particular way to do it? Should it just be in increasing numerical order?

sequence genome contig • 5.8k views

ADD COMMENT • link updated 11.0 years ago by Philipp Bayer 8.8k • written 11.0 years ago by John Smith ▴ 320

Ram · Accepted Answer · 2014-07-09

4

Entering edit mode

11.0 years ago

Philipp Bayer 8.8k

You're right that contigs are just fragments of the genome, and that scaffolding is the next step in assembly. Usually, this is done using genetic maps and SNPs (or other markers) so that the contigs can be anchored along that genetic map. Here's one recent example with the Brassica oleracea genome: The Brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes

Looking at the publication for your genome, it doesn't look like they performed scaffolding, maybe because there is nothing to use as a reference to scaffold against. Therefore, it is likely that the 245 contigs are just numbered by the order they fell out of the assembly program, an order which doesn't reflect the 'real' genome. In that case, I wouldn't concatenate them.

What do you want to do with the contigs? If it's just SNP calling or something like that, I would leave the sequences as contigs.

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 11.0 years ago by Philipp Bayer 8.8k

0

Entering edit mode

I am trying to align raw sequencer data to the genome of BR1064. But it appears that there is no reference genome so I thought that maybe it was possible to just obtain it by putting all the contigs together.

ADD REPLY • link 11.0 years ago by John Smith ▴ 320

1

Entering edit mode

What aligner are you using? Most of them (BWA, bowtie2) can handle several reference sequences, so if you just leave the 245 sequences in one big fasta file and align your reads against that file, you should be alright.

Of course you might see some weird things like the first read of a pair aligning on contig_1 and the second read of a pair aligning on contig_2, but then you know that these two contigs are closeby. That's why software like GapFiller exists!

ADD REPLY • link 11.0 years ago by Philipp Bayer 8.8k

0

Entering edit mode

I am aligning the reads with Bowtie 2. Would I need to specify a certain parameter for Bowtie 2 since the FASTA file will contain several sequences (contigs in this case)? Or will it automatically recognize that there are several sequences in the FASTA file and attempt to align the reads?

ADD REPLY • link 11.0 years ago by John Smith ▴ 320

2

Entering edit mode

That's good! bowtie2-build doesn't take any special arguments for several sequences (as far as I remember), so just do:

bowtie2-build your_sequences.fasta the_name_you_want

and then use the name you want as the reference in subsequent bowtie2 runs.

ADD REPLY • link updated 6.1 years ago by Ram 45k • written 11.0 years ago by Philipp Bayer 8.8k