I am still new to bioinformatics and I have not yet fully understood the definition of contig. I have read a few explanations and what I understand is that contigs are fragments of the genome for which we are certain that the order of the bases is correct. Then, we make scaffolds out of the contigs and the goal is to get one scaffold to represent the entire genome.
Right now, I am trying to obtain the full reference genome in FASTA format of Streptococcus pneumoniae BR1064. I found this at ENA and in the top right category under "Send Feedback" it appears "Genome Representation: full". From there, one can get over to the assembly contig and there are 245 contigs. Can I just put all this contigs together and obtain the full genome of the organism? If so, is there a particular way to do it? Should it just be in increasing numerical order?
I am trying to align raw sequencer data to the genome of BR1064. But it appears that there is no reference genome so I thought that maybe it was possible to just obtain it by putting all the contigs together.
What aligner are you using? Most of them (BWA, bowtie2) can handle several reference sequences, so if you just leave the 245 sequences in one big fasta file and align your reads against that file, you should be alright.
Of course you might see some weird things like the first read of a pair aligning on contig_1 and the second read of a pair aligning on contig_2, but then you know that these two contigs are closeby. That's why software like GapFiller exists!
I am aligning the reads with Bowtie 2. Would I need to specify a certain parameter for Bowtie 2 since the FASTA file will contain several sequences (contigs in this case)? Or will it automatically recognize that there are several sequences in the FASTA file and attempt to align the reads?
That's good!
bowtie2-build
doesn't take any special arguments for several sequences (as far as I remember), so just do:and then use the name you want as the reference in subsequent bowtie2 runs.