Question

Remove specific mapped reads from SAM file

0

Entering edit mode

10.5 years ago

Jennifer • 0

Hi everyone,

I did the alignment of RNA seq reads to a number of reference genomes at once, which include the human genome and some bacterial genomes, and now I have one SAM file of all the mapped reads. What I need to do next is to count the number of reads that mapped to only bacterial genomes (regardless of them being multimapped reads or unique mapped reads) and discard all the reads that mapped to both bacterial genomes and human genome. I am not sure if there's any tool or any other ways of alignment to do it, but what I have right now is a bash script that takes a long time because my SAM file is ~ 66gb.

What I did was to extract QNAME(column 1) and SEQ(column 10) that contain "chr" (for human genomes) and lines that contain "NC_" (for bacterial genomes) into separate files:

QnameSeqMappedToHg19.txt
Bacteria_mapped.sam

cat QnameSeqMappedToHg19.txt|while read Qname Seq
do
    cat Bacteria_mapped.sam|while read ID Flag RName Pos MapQ Cigar MRNM MPos ISize SeqB Qual Tags
    do
        if [ "${ID}" == "${Qname}" ]
        then
            if [ "${SeqB}" == "${Seq}" ]
            then
                sed -i '/${SeqV}/d' Bacteria_mapped.sam
            fi
        fi
    done
done

I have to include sequence and qnames because the reads are paired-end, so I might have two same qnames but different sequences that mapped to the genomes. I am trying to remove all the lines in my bacteria sam files that contain the same qnames and sequence mapped to the human genome. Is there any other way that I can do it?

Thank you,
JenniferT

alignment • 3.3k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.5 years ago by Jennifer • 0

Ram · Answer 1 · 2015-05-15

Hi Jennifer,

I suggest that the best way to accomplish this is to use BBSplit. This allows you to map to multiple references simultaneously with multiple output files, one per reference; reads that map best to a given reference will go to that output file. You can control what happens to reads that map equally well to multiple references with the ambig2 flag. For example, concatenate all your bacterial references into bacteria.fa, and then run like this:

bbsplit.sh in=reads.fq ref=bacteria.fa,hg19.fa basename=out_%.fq ambig2=toss

You can change the extension of the output to .sam for sam output if you want. This command also will only map reads with around 95% identity; if you want low-identity reads then add the flag minid=0.75.