Question: Analyzing Unaligned Sequence (From Bowtie)
gravatar for newDNASeqer
7.3 years ago by
United States
newDNASeqer710 wrote:

From my RNASeq data, Bowtie aligned 80-90% of the total reads using human genome as the reference, and I am interested in taking a look at the unaligned sequences. Because the RNASeq samples were extracted from humanized mice that has human tumor transplanted, the tissue could be mixed with mice mRNA.

I want to see how much of the unaligned sequences derived from mice and how much of them is from human. Is there any tools that I can use? or I will need to create my own code to read each unaligned sequence and do sequence blast?

bowtie • 4.1k views
ADD COMMENTlink modified 7.3 years ago by swbarnes29.2k • written 7.3 years ago by newDNASeqer710

doesn't bowtie have a command to output unaligned sequences? that should make it easy.

also, are you using bowtie directly or bowtie via tophat?

ADD REPLYlink written 7.3 years ago by brentp23k

I used bowtied via tophat, and it generated a file named "unmapped.bam" file. Within this unmapped.bam file, I want to get information what percentage of the unaligned reads are human and how many are mice. Does bowtie have this functionality? thanks

ADD REPLYlink written 7.3 years ago by newDNASeqer710
gravatar for swbarnes2
7.3 years ago by
United States
swbarnes29.2k wrote:

If you knew the samples were mixed, you could have aligned to a combination of mouse and human reference. That's the best solution; if a read matches mouse perfectly, and human with 2 mismatches, you want it to align to mouse, not to be forced to align to human.

I'd start by doing a simple count of each sequence represented in the unmapped reads. Something like

samtools view unmapped.bam | cut -f 10 | sort | uniq -c | sort -nr > sorted_reads.txt

Will make you a list of each read, and how often it turns up. Obviously it will separate reads that differ from each other by an error, which is not ideal, but this is at least a place to start. If 90% of the reads are the exact same sequence, you want to know that before you start blasting each one individually.

You could also try de novo assembly on the unmapped reads. It won't do much if your reads are scattered across a mammalian genome, but if they are something else, like mouse mitochondrial genes, it could help.

But before you make a program to BLAST each one, just spot-check a few. 80-90% aligned seems alright to me, you might just be looking at noise.

ADD COMMENTlink written 7.3 years ago by swbarnes29.2k

Thank you for the reply. Is the command "samtools view unmapped.bam | cut -f 10 | sort | uniq -c | sort -nr > sorted_reads.txt" outputing all the unmapped reads, only once for each read? I am new to this tool, how do I compare how often each read turns up?

ADD REPLYlink written 7.3 years ago by newDNASeqer710

What that command is is samtools at the head, and lots of unix commands daisy-chained afterwards, so that you don't have intermediate files.

samtools view converts the .bam into plain text. That gets piped into cut, which takes the contents of the 10th column, which is the sequence. The first sort sorts all those sequences alphabetically, uniq -c then compresses the list, so that you see each read, and how many times that exact read appears. The second sort sorts by that count number, so at the end, the top of your file contains the most abundant sequences, and how many times those sequences appears in your .bam

As written, that command will count up every read in the file you give it. If your file is a mix of mapped and unmapped reads, using "-f4" option in samtools view will cause samtools to skip the mapped reads.

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by swbarnes29.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1707 users visited in the last hour