Hi everyone,
I did the alignment of RNA seq reads to a number of reference genomes at once, which include the human genome and some bacterial genomes, and now I have one SAM file of all the mapped reads. What I need to do next is to count the number of reads that mapped to only bacterial genomes (regardless of them being multimapped reads or unique mapped reads) and discard all the reads that mapped to both bacterial genomes and human genome. I am not sure if there's any tool or any other ways of alignment to do it, but what I have right now is a bash script that takes a long time because my SAM file is ~ 66gb.
What I did was to extract QNAME(column 1) and SEQ(column 10) that contain "chr" (for human genomes) and lines that contain "NC_" (for bacterial genomes) into separate files:
- QnameSeqMappedToHg19.txt
- Bacteria_mapped.sam
cat QnameSeqMappedToHg19.txt|while read Qname Seq
do
cat Bacteria_mapped.sam|while read ID Flag RName Pos MapQ Cigar MRNM MPos ISize SeqB Qual Tags
do
if [ "${ID}" == "${Qname}" ]
then
if [ "${SeqB}" == "${Seq}" ]
then
sed -i '/${SeqV}/d' Bacteria_mapped.sam
fi
fi
done
done
I have to include sequence and qnames because the reads are paired-end, so I might have two same qnames but different sequences that mapped to the genomes. I am trying to remove all the lines in my bacteria sam files that contain the same qnames and sequence mapped to the human genome. Is there any other way that I can do it?
Thank you,
JenniferT
I will try that out, thank you so much!