Hello everybody,
I'm interesting for the mitochondrial assembly from several samples of sheep reads.
Based on the post "Tutorial: Reference Assembly - Mapping Reads To A Reference Genome" I have mapped my reads with the sheep reference by using bowtie2 and stampy.
I have generated the bam files, I extracted the reads that are mapped onto the mitochondrial by using samtools and after that again with samtools and the following command I generated the consensus (fq)
samtools mpileup \
-E \
-uf oviAri3.fa \
B2_ACAGTG_L001_PE_stampy.sorted.chrM.bam | \
bcftools view -cg - | \
vcfutils.pl vcf2fq > B2_ACAGTG_L001_PE_stampy_chrM_cons.fq
or
samtools mpileup \
-E \
-uf oviAri3.fa \
B2_ACAGTG_L001_PE_bwt.sorted.chrM.bam | \
bcftools view -cg - | \
vcfutils.pl vcf2fq > B2_ACAGTG_L001_PE_bwt_chrM_cons.fq
for bowtie output.
After that, by using the seqret from emboss package I converted the fq to fa.
Comparing the two fa files with the reference, Im seeing that there are some differences. Bowtie assembly has some regions with n's, stampy has only 10 n's at the beginning and a little bit more differences in some bases. Of course the most of the differences with the reference are the same.
How can I identify which assembly is better?
Is there any other step that I have to follow and I didn't do it?
Since I am really new in that field, I will appreciate for any help.
Thank you very much in advance,
Vasilis.
Thank you very much for your answer. It is very helpful. So, as it seems in order to produce more accurate assembly by using bowtie2 I believe that I have to use the
--end-to-end
parameter. By using that I will exclude all the reads that partially map. Am I right? Also I have used-X 1800
for the insert size of the pair end reads, and maybe that its too long for internal length. Stampy adjusts the intermediate length by itself and I believe that this is the reason of the absence of n's.Thank you very much for the tool (ARC) that you proposed me, it seems very promising. At the moment I'm trying to install biopython and the other components that it needs in order to test it.
Depending on your data, --end-to-end may be better. Just be aware that since your genome is circular you may experience reduced coverage on the ends, because you provide a linear reference. I'd try both and compare.
An insert size of 1800 bp seems a bit large, but perhaps this is correct for your data. Remember, the insert size distribution is the range of sizes you fragmented your library to when you generated it. Setting this value helps the mapper make decisions about the location of reads (e.g., if your average insert size is 300bp and you have a pair of reads that map 10,000bp apart in your reference, its probably not correct).
If you have any questions about ARC, please post on the usergroup and one of us will get back to you.
Thank you very much for your support! :)