Question: Remove specific mapped reads from SAM file
gravatar for Jennifer
3.9 years ago by
United States
Jennifer0 wrote:

Hi everyone, 

I did the alignment of RNA seq reads to a number of reference genomes at once, which include the human genome and some bacterial genomes, and now I have one SAM file of all the mapped reads. What I need to do next is to count the number of reads that mapped to only bacterial genomes (regardless of them being multimapped reads or unique mapped reads) and discard all the reads that mapped to both bacterial genomes and human genome. I am not sure if there's any tool or any other ways of alignment to do it, but what I have right now is a bash script that takes a long time because my SAM file is ~ 66gb.

What I did was to extract QNAME(column 1) and SEQ(column 10) that contain "chr" (for human genomes) and lines that contain "NC_" (for bacterial genomes) into separate files:

1) QnameSeqMappedToHg19.txt

2) Bacteria_mapped.sam

cat QnameSeqMappedToHg19.txt|while read Qname Seq
    cat Bacteria_mapped.sam|while read ID Flag RName Pos MapQ Cigar MRNM MPos ISize SeqB Qual Tags
        if [ "${ID}" == "${Qname}" ]
            if [ "${SeqB}" == "${Seq}" ]
                sed -i '/${SeqV}/d' Bacteria_mapped.sam

I have to include sequence and qnames because the reads are paired-end, so I might have two same qnames but different sequences that mapped to the genomes. I am trying to remove all the lines in my bacteria sam files that contain the same qnames and sequence mapped to the human genome. Is there any other way that I can do it? 

Thank you,

alignment • 1.6k views
ADD COMMENTlink modified 3.9 years ago by Brian Bushnell16k • written 3.9 years ago by Jennifer0
gravatar for Brian Bushnell
3.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Hi Jennifer,

I suggest that the best way to accomplish this is to use BBSplit.  This allows you to map to multiple references simultaneously with multiple output files, one per reference; reads that map best to a given reference will go to that output file.  You can control what happens to reads that map equally well to multiple references with the ambig2 flag.  For example, concatenate all your bacterial references into bacteria.fa, and then run like this: in=reads.fq ref=bacteria.fa,hg19.fa basename=out_%.fq ambig2=toss

You can change the extension of the output to .sam for sam output if you want.  This command also will only map reads with around 95% identity; if you want low-identity reads then add the flag "minid=0.75".

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Brian Bushnell16k

I will try that out, thank you so much! 

ADD REPLYlink written 3.9 years ago by Jennifer0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 773 users visited in the last hour