What Does The Sequence Nnnn... Mean On A Sam File?
4
1
Entering edit mode
10.9 years ago
gabrieledcjr ▴ 10

I am a programmer and I just got this position to help on a research project. However, I don't have a background on DNA sequencing. Although I am reading about it and learning more about it. The only way to get this project running is to understand the data well.

This is part of a SAM file. What does NNN... mean? I get the idea that when a FASTQ file is aligned with a referenced genome that its output is a BAM file which can be converted to a SAM file. The goal is to extract the unmapped reads from the SAM file, convert it back to FASTQ and compare it with another reference genome. But I'm on the first step, extracting the unmapped reads of the SAM file, but I've notice this sequence below and I don't understand it. I was wondering if this is something I can delete or does this mean anything.

chr17_77390200_77390383_0:0:0_1:0:0_54817    117    chr17    77379231    0    *    =    77379231    0    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    22222222222222222222222222222222222222222222222222
sam sequence sequence • 12k views
ADD COMMENT
4
Entering edit mode
10.9 years ago
skymningen ▴ 330

Ns mean, that there is no base assigned in your sequence. It is an ambiguous code for "any of the four bases". For IUPAC ambiguity codes look here: IUPAC ambiguity codes

More importantly, there are tools that will do your task on the original bamfile. For example the bamtools : bamtools

bamtools filter will do the trick in filtering out unmapped reads. If you want only the unmapped ones, use -isMapped false otherwise, -isMapped true.

Then there is also bedtools:

bedtools

This includes bamToFastq, that will convert you filtered bam file to fastq, even paired end ones if the original input was paired end.

Hope this helped!

ADD COMMENT
0
Entering edit mode

thanks a lot. I will try them. I've successfully was able to get the desired output but the runtime is very slow and picard seems to eat a lot of memory. I will the tools you mentioned.

ADD REPLY
2
Entering edit mode
10.9 years ago
Irsan ★ 7.8k

In your case, the sequence if the read in your example is NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN. That means that for each base pair position in this read, the sequencer was not able to determine if the base was A, T, C or G so this read is useless. Because there are parts of the reference genome where the sequence is not known yet, parts of the reference also consists of contiguous N's and your read matches perfectly to those locations. I would say that reads full of N's are of no use. You may want to consider quality filtering your reads before doing any analysis with something like FastQC. Watch their video tutorial and you will be up to speed in no time.

ADD COMMENT
0
Entering edit mode

Thanks for the insight. It's very helpful. Just wondering about quality filtering, does it also filter other reads whose quality are not good and how does it measure when a quality is good or not?

ADD REPLY
0
Entering edit mode

FastQC does not filter reads (so I put it wrong in the above answer), it just diagnoses quality issues in your reads. After doing FastQC, you can decide yourself whether to filter in any way you like.

ADD REPLY
1
Entering edit mode
10.9 years ago
Ido Tamir 5.2k

I would make a note that the read has only Ns (i.e. the sequencer could not decide which base it really is), but would not try to align it to a different genome. Depending on the aligner and settings, it might actually align somewhere (e.g. at the centromeric regions of the mouse which is only Ns). The second read of this pair actually did align. So I would not count it as "unaligned".

Simply taking everything that does not align and trying to align somewhere else might not give you satisfactory answers. Look for low quality reads, High Ns, contaminants (adaptors ...).

ADD COMMENT
0
Entering edit mode

Thanks as well for your answer. I really like this forum. It's been very helpful.

ADD REPLY
0
Entering edit mode
10.9 years ago

The NNNNNNN here was the original sequence or read. The sequencer sometimes can't resolve with enough confidence about which base was sequenced and assigns letter 'N' instead of A,T,C,G. In this case, this sequence or read was part of a pair where the other pair was mapped by the aligner. Check SAM format specifications for more information. http://samtools.sourceforge.net/SAM1.pdf

To extract all the unmapped reads from the bam file or sam file , you can use the mapping quality information which will be zero for the unaligned reads.

ADD COMMENT
0
Entering edit mode

Thank you for taking the time to answer my question. It's been very helpful.

ADD REPLY

Login before adding your answer.

Traffic: 2981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6