mitochondrial dna assembly comparison
1
1
Entering edit mode
9.4 years ago
vasilislenis ▴ 150

Hello everybody,

I'm interesting for the mitochondrial assembly from several samples of sheep reads.

Based on the post "Tutorial: Reference Assembly - Mapping Reads To A Reference Genome" I have mapped my reads with the sheep reference by using bowtie2 and stampy.

I have generated the bam files, I extracted the reads that are mapped onto the mitochondrial by using samtools and after that again with samtools and the following command I generated the consensus (fq)

samtools mpileup \
  -E \
  -uf oviAri3.fa \
  B2_ACAGTG_L001_PE_stampy.sorted.chrM.bam | \
  bcftools view -cg - | \
  vcfutils.pl vcf2fq > B2_ACAGTG_L001_PE_stampy_chrM_cons.fq

or

samtools mpileup \
  -E \
  -uf oviAri3.fa \
  B2_ACAGTG_L001_PE_bwt.sorted.chrM.bam | \
  bcftools view -cg - | \
  vcfutils.pl vcf2fq > B2_ACAGTG_L001_PE_bwt_chrM_cons.fq

for bowtie output.

After that, by using the seqret from emboss package I converted the fq to fa.

Comparing the two fa files with the reference, Im seeing that there are some differences. Bowtie assembly has some regions with n's, stampy has only 10 n's at the beginning and a little bit more differences in some bases. Of course the most of the differences with the reference are the same.

How can I identify which assembly is better?

Is there any other step that I have to follow and I didn't do it?

Since I am really new in that field, I will appreciate for any help.

Thank you very much in advance,

Vasilis.

assembly mitochondria samtools consensus • 3.5k views
ADD COMMENT
1
Entering edit mode
9.4 years ago
Brice Sarver ★ 3.8k

You often see this with different mappers, and it really depends on the settings that you've used for the mapping itself. In other words, what did you do to generate the .bam? I assembled some mitochondrial genomes several years ago using a mapping/consensus approach in Bowtie2 much the same way you are doing, and I noticed that the 'quality' of the final result depends on whether reads were soft-clipped or not either using --local or --end-to-end. Remember that you're basically stacking up your reads on your reference, and how many reads stack and where depends on your settings.

I'm going to give a shameless plug here and recommend Assembly by Reduced Complexity (ARC), a project that I am involved with that is currently available. It does a better job than all existing applications (that I am aware of) of iteratively assembling sequences starting from a (possibly very divergent) reference in a parallel fashion. It does an amazing job with circular genomes like yours, too, requiring little or no trimming of overlapping circular ends after the fact. I used it to assemble ~60 chipmunk mitochondrial genomes starting from the squirrel (and it works with monotremes and marsupials, too!) with high accuracy and effectively no reference bias. In contrast, Bowtie2/samtools introduced Ns at the ends or internally where there was a dip in coverage, just like you are seeing.

Another, less accurate option is to do a de novo assembly using, say, Newbler and BLAST the contigs to your reference. This is more computationally and informatically intensive.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer. It is very helpful. So, as it seems in order to produce more accurate assembly by using bowtie2 I believe that I have to use the --end-to-end parameter. By using that I will exclude all the reads that partially map. Am I right? Also I have used -X 1800 for the insert size of the pair end reads, and maybe that its too long for internal length. Stampy adjusts the intermediate length by itself and I believe that this is the reason of the absence of n's.

Thank you very much for the tool (ARC) that you proposed me, it seems very promising. At the moment I'm trying to install biopython and the other components that it needs in order to test it.

ADD REPLY
1
Entering edit mode

Depending on your data, --end-to-end may be better. Just be aware that since your genome is circular you may experience reduced coverage on the ends, because you provide a linear reference. I'd try both and compare.

An insert size of 1800 bp seems a bit large, but perhaps this is correct for your data. Remember, the insert size distribution is the range of sizes you fragmented your library to when you generated it. Setting this value helps the mapper make decisions about the location of reads (e.g., if your average insert size is 300bp and you have a pair of reads that map 10,000bp apart in your reference, its probably not correct).

If you have any questions about ARC, please post on the usergroup and one of us will get back to you.

ADD REPLY
0
Entering edit mode

Thank you very much for your support! :)

ADD REPLY

Login before adding your answer.

Traffic: 2710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6