We are sequencing a bacterial genome with a Gridion machine from ONT. As already expected, the error rate was quite high and I noticed lots of insertions and deletions compared to the reference genome.
Although I reckon the sequencing and the mapping to the reference genome both went well, I was wondering if 'polishing' the fastq files could improve the mapping stats. For instance, I checked in this post that the quality of the first 40-50 nucleotides in the reads tend to be low. Also I wanted to evaluate if/ to which extent selecting reads of of higher quality (e.g. 12 (phred scale), which is actually the median of reads quality) would enhance alignment.
I am now wondering how could I evaluate the mapping of the reads after polishing/ filtering the fastq files described above. I initially checked the alignment with tablet, but I am more after a quantitative (other than visual) assessment.
So far I have used
samtools flagstat which gave me:
7014141 + 0 in total (QC-passed reads + QC-failed reads) 781294 + 0 secondary 409167 + 0 supplementary 0 + 0 duplicates 6836429 + 0 mapped (97.47% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
I guess the percentage of mapped reads (97.47%) can be useful, but it is already a high mapping rate and I am not really expecting that trimming the first nucleotides will increase the mapping rate (please correct me in case I am interpreting the
samtools flagstat output wrongly).
I also found a tool called Qualimap, though it appears to be computationally very expensive and the command line tool does not appear to work on linux.
Has anyone already carried out this analysis i.e. how can one assess the improvement in the mapping to the reference genome after polishing the fastq files?
Sequence alignment was done with
minimap2 and indexing, sorting with