Theoretical question regarding reads and assemblies - sources of bias
0
0
Entering edit mode
5.1 years ago
yairgatt ▴ 10

Hello all, I have come across an interesting issue on which I would love to hear other people's comments. I assembled Illumina short paired-end reads from bacterial WGS data to an assembly using SPAdes. I then broke that assembly up to fictitious reads (producing a 75 nucleotide long read from every 5th bp in the assembly, so that each position is covered roughly 15 times). I took my real reads and my fictitious reads and mapped them both to a reference genome for that species. I got quite different results, with 70% mapping percentage in the real reads and 90% mapping percentage in the fictitious reads. When I tried to call variations from the reference genome (I used breseq) there were about 200 differences between the variations from the reference genome as produced by the real reads and the variations from the reference genome as produced by the fictitious reads.

What do you think might be the different sources of those differences?

Many thanks for any comments, Yair

Assembly SNP sequencing • 730 views
ADD COMMENT
0
Entering edit mode

I got quite different results, with 70% mapping percentage in the real reads and 90% mapping percentage in the fictitious reads.

Is it not possible that many of the real reads were simply never used when performing the initial assembly step? - this may explain the finding that the alignment % differs.

For the variant calling part, I suspect that the read depth (broadly speaking, 'depth of coverage') could be an issue. Sensitivity and specificity of variant calling is affected by position read depth.

ADD REPLY
0
Entering edit mode

Thank you for your reply! I also suspect that many reads were not assembled during the initial assembly phase (perhaps lower quality reads, that also do not map to the reference). Regarding the variant calling, I will read up on it. Though I suspected reads depth would not have a major effect when looking only at the major variant in each position (as I though the major variant would be the one occurring in the assembly).

ADD REPLY
0
Entering edit mode

On the variant calling, do you have BAM files that you could, for example, load into IGV (Integrated Genomics Viewer)? Then you would see what may be causing the discrepancy.

ADD REPLY

Login before adding your answer.

Traffic: 2678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6