How To Handle Ns In The Middle Of Reads
Entering edit mode
9.2 years ago
kautilya ▴ 430

For my illumina data fastqc shows presence of N's at positions 13,14,15 in 101 bp longs reads. If I go for cropping first 15 bases by using trimmomatic, it solves the problem but I lose a lot of data. I wanted to know that if I retain the N's what sort of problems would they cause during alignment(bwa+stampy)/variant calling(unified genotyper) and how can I handle these problems?

If any body faced a similar problem how did you handle it?

Similar questions asked on different forums but none has answered.

Could not find a resourse on how variant calling programs handle N's. Do they ignore them? Or consider them as a variation with low confidence scores?

Following is the image for per base n content from fastqc

fastqc qc • 3.1k views
Entering edit mode

Shouldn't you first investigate why you got those weird Ns at these positions?

Entering edit mode

These are possibly due to machine read errors during sequencing. These are particular to only 1 of 3 runs. Looking for a way of handling these without losing a lot of sequence data.

Entering edit mode
7.9 years ago
Gabriel R. ★ 2.9k

If you want to do a BWA followed by GATK, I would use your reads as is. They likely have a base quality of 0 and GATK overlooks them. BWA will substitute them for random bases but fallacious alignments induced by those bases will be rare.

The cause? If this is Illumina, sometimes the reagents do not make it to the flowcell for a cycle or two due to pump problems or air bubbles.


Login before adding your answer.

Traffic: 1887 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6