Question: SOAPaligner 2.21 - does it replace all Ns by Gs in reads?
6
gravatar for Philipp Bayer
5.1 years ago by
Philipp Bayer6.3k
Australia/Perth/UWA
Philipp Bayer6.3k wrote:

We've run into an interesting problem with SOAPaligner 2.21, it seems to replace all Ns in query reads by Gs, has anyone ever seen this?

Here's a simple test, first the reference:

>First
ATCGTCGATCGATCATCATCGATGCGAGCTAGCTAGCTACTAGCAT
ATCGTCGATCGATCATCATCGATGCTAGCTAGCTAGCTACTAGCAT
GTCAGTCGATGCATGCTAGCTGATCGATCGATGCTAGTCAGTCAGC
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTGACTAGCTAGCTAGT
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTGACTAGCTAGCTAGT

The first read:

@First/1
ATCGTCGATCGATCATCATCGATGNNAGCTAGCTAGCTACTAGCAT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

As you can see, two Ns (in bold). The second read:

@First/2
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTGACTAGCTAGCTAGT
+
##############################################

No Ns, just as a test.

Running soap:

soap -a first.fastq -b second.fastq -D reference.fasta.index -o paired.soap -2 unpaired.soap -u unmapped.soap -m 0 -x 1000

Total Pairs: 1 PE
Paired:      0 ( 0.00%) PE
Singled:     2 (100.00%) SE

And the output:

First/1 ATCGTCGATCGATCATCATCGATGGGAGCTAGCTAGCTACTAGCAT  BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB  2       a       46      +       First   47      2       C->24G2 T->25G2 46M     24CT20
First/2 GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTGACTAGCTAGCTAGT  ##############################################  6       b       46      +       First   323     0       46M     46

 

As you can see, it aligns the read 'First/1' to the second line in the reference. There are two differences between the reference and the read in bold - here soap says that the reference's C differs from a G in the read, and a T differs from the read's G. But where do these Gs come from? The read has Ns there!

We've played around a bit with it and it seems to silently replace Ns with Gs everywhere, has anyone else seen this before? We're a bit concerned this might accidentally introduce G SNPs.

 

alignment soapaligner • 2.0k views
ADD COMMENTlink modified 5.0 years ago by orchid000 • written 5.1 years ago by Philipp Bayer6.3k
3

this is one of those things where you say to yourself: nah that can't be true,

but then it is bioinformatics where everything is possible and the points don't matter

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Istvan Albert ♦♦ 80k

Hahahah xD

ADD REPLYlink written 3.3 years ago by John12k

#SOAPdenovo, all readears please spread the word, people need to know about this.

 

ADD REPLYlink written 5.0 years ago by orchid000

If you carefully look the SOAPaligner usage, you will find out that the reason of this replacement is the -n option:

-n <int> filter low-quality reads containing >n Ns before alignment, [5]

So the default value for Ns is 5. You can set it to 0 or whatever you want.

As for the N to G replacement, I think that it could cause some problems, if you are carrying and using the transformed mapped read sequence in the next steps of your analysis. Otherwise, you shouldn't be concerned for the actual mapping of such reads , as the -n Ns are just ignored ( and replaced with Gs as you observed! ).

ADD REPLYlink written 3.7 years ago by mitsias0

Did you determine how widespread this is and if there was a fix? A little worried to see this now, 20 months later!

ADD REPLYlink written 3.3 years ago by Daniel3.7k

I've sent an email to them but since it's not really developed I never heard back. Most SNP calling pipelines should remove this SNP as the base quality to call it is usually the worst possible, but it doesn't hurt to check!

ADD REPLYlink written 3.3 years ago by Philipp Bayer6.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1659 users visited in the last hour