Hi all,
I have a SAM file from paired end sequencing data, I wrote a short script (below) to filter some reads (reads that map to bacteria):
awk 'FNR==NR { a[$1]; next } !($3 in a)' contigs_bacteria.txt <(samtools view -h aligned_reads.sam) > aligned_reads_no_bacteria.sam
This works fine, however when I check count of unmapped reads in the original sam file (aligned_reads.sam) vs. the filtered one, the # of unmapped reads are different:
samtools view -c -f 4 aligned_reads.sam
316971
samtools view -c -f 4 aligned_reads_no_bacteria.sam
293621
Contigs_bacteria.txt is just a text file that looks like this, and I use these values to filter the sam file, as I know reads with these values come from bacteria.
201
908
925
881
895
921
889
913
875
922
923
915
878
910
I guess it is also filtering out some unmapped reads in the process. or maybe this is due to the fact that they are paired and if only one mate meets the criteria, that pair is lost. I do not want to lose the unmapped reads as they can potentially be reassembled. Any ideas/help is appreciated.
Thanks for your response, it helps a lot! Just a small follow-up, what if one read in a pair is mapped to one chromosome/sequence name, and the other read is mapped to a different chromosome for example...Would this ever happen? and in general should two pairs in a read always remain together when we filter these files? Thank you.