Number of unmapped reads after filtering issue
1
1
Entering edit mode
4.1 years ago
max_19 ▴ 170

Hi all,

I have a SAM file from paired end sequencing data, I wrote a short script (below) to filter some reads (reads that map to bacteria):

awk 'FNR==NR { a[$1]; next } !($3 in a)' contigs_bacteria.txt <(samtools view -h aligned_reads.sam) > aligned_reads_no_bacteria.sam


This works fine, however when I check count of unmapped reads in the original sam file (aligned_reads.sam) vs. the filtered one, the # of unmapped reads are different:

samtools view -c -f 4 aligned_reads.sam
316971

samtools view -c -f 4 aligned_reads_no_bacteria.sam
293621


Contigs_bacteria.txt is just a text file that looks like this, and I use these values to filter the sam file, as I know reads with these values come from bacteria.

201
908
925
881
895
921
889
913
875
922
923
915
878
910


I guess it is also filtering out some unmapped reads in the process. or maybe this is due to the fact that they are paired and if only one mate meets the criteria, that pair is lost. I do not want to lose the unmapped reads as they can potentially be reassembled. Any ideas/help is appreciated.

sequencing sam samtools • 1.1k views
4
Entering edit mode
4.1 years ago

Hello max_19 ,

if one read of a pair can be mapped and aligned but the other one not, than the one that can not be mapped is assigned to the same chromosome/sequence name like the one that can be mapped, but its flagged as unmapped. So if you exclude chromosomes/sequence names it is normal, that the number of unmapped reads decreases.

fin swimmer

0
Entering edit mode

Thanks for your response, it helps a lot! Just a small follow-up, what if one read in a pair is mapped to one chromosome/sequence name, and the other read is mapped to a different chromosome for example...Would this ever happen? and in general should two pairs in a read always remain together when we filter these files? Thank you.