How does one count unmapped transcripts ?
2
0
Entering edit mode
3.1 years ago
roy.granit ▴ 880

Hello all, I have a list of un-mapped reads, is there a way to generate a ranked list of transcripts, meaning count the most frequent ones?

Thanks!

RNA-Seq • 1.1k views
0
Entering edit mode

Well, if it is unmapped this suggests those reads do not originate from the (annotated) transcriptome. What makes you think they are meaningful?

0
Entering edit mode

Indeed, I'm aware that I cannot align these sequences to the transcriptome/genome, but these reads are derived from a certain cell line which might express some 'external' genes that were transfected and I wish to know how many of the reads map to these genes (before I generate a hybrid genome)

1
Entering edit mode

It sounds like you know where these "external" genes have come from. If you have a fasta file with the sequence of these "external" genes then use that for mapping and see how many of the unmapped reads map to this second fasta/genome file.

Depending on your mapping algorithm you might have an unmapped.bam file. You can run bamToFastq and then remap those unmapped reads to your "external" genome. From there you can determine if you want to remap all genes to a hybrid genome.

0
Entering edit mode
3.1 years ago

I guess you could try to assemble the unmapped reads, and blast the best looking contigs to nr, see if you can identify where they came from. Or just blast the most common unmapped reads.

0
Entering edit mode

But how do I find the most common reads to blast them?

2
Entering edit mode

If you have reads in a .sam file and you want to count their frequency you could make a quick python script:

inFile = open('unmapped.sam','r')
countDict = {}

#Remove header starting with @ symbol
while line.startswith('@'):

while line != '':
lineList = line.rstrip().split('\t')
if lineList[9] not in countDict: #If this is the first time this sequence is seen
countDict[lineList[9]] = 1
else:
countDict[lineList[9]] = countDict[lineList[9]] + 1 #If this sequence has been seen before add one to the number of times it's been seen

#Loop through dictionary and print tab separated read and number of counts

inFile.close()
outFile.close()


It's worth noting that this will count the number of occurrences for identical reads, if you have a sequence that you expect these reads to map to it's a much better idea to map these to that template like I suggested in the above comment.

2
Entering edit mode
samtools view -f 4 mydata.bam | cut -f 10 | sort | uniq -c | sort -nr | head

0
Entering edit mode

Thanks. Ended up using this code after converting my .mate1 files into bam using picardtools

0
Entering edit mode
3.1 years ago

Do you have a lot of unmapped reads?

Sometimes, there may really be something else contributing to true RNA-Seq sequencing in your sample (where there may actually be benefit to using a joint reference). For example, if you have a virus infection and there are a lot of virus reads (where you may want to use a human+virus reference).

However, I think there is always some amount of cross-contamination and other factors that are going to keep all 100% of your reads from being important to understand. For example, if you have filtered things like adapters and other non-biological sequences, and you still have <2% unexplained reads, I think limits to the genome assembly, purity of de-muliplexed samples, etc. could be a factor.

Not sure if this is relevant to you, but I think there are currently top PhiX BLAST hits that aren't accurate (because someone with a 16S study didn't understand that you can get PhiX sequence in your de-multiplexed samples, even though it wasn't supposed to be there, probably with a single-barcode):

C: Calling Single-Barcode Samples from Mixed Runs as Dual-Barcode Samples | Possibl

In other words, I think it makes a difference whether you have 30% unaligned reads or 5% unaligned reads.

0
Entering edit mode

After adaptor cleaning I have about 30% of unaligned reads therefore I wish to inspect these reads and see what is the source, ideally it would be great to blast these sequences since I have no clue if these originate due to bacterial or viral contamination.

1
Entering edit mode

That does seem pretty high.

I would probably recommend converting those reads to FASTQ (if not done already) and:

1) Look for Over-Represented Sequences using FastQC

2) Perform a de novo assembly and check the largest and most highly expressed unaligned contigs. While you may need to test multiple methods, one possibility is to use Oases for assembly, Bowtie2 + eXpress for alignment / quantification, and BLAST for annotation.

1
Entering edit mode

Of that 30% how many are categorized as "too short"? @Charles Warden makes a good point about FastQC, you want to be certain this isn't a sequencing/library prep artifact before investing too much time into determining the source of these reads.

0
Entering edit mode

Yes - good point. Thank You. FastQC will also give you other statistics as well, including read lengths (assuming you have trimmed them before your 1st alignment).

If that is the problem, your read trimmer probably has a parameter to require a minimum length of read (which is probably a good idea).

1
Entering edit mode

roy.granit : Have you taken a few of these and blasted them at NCBI? I would start there before you invest a lot of time. If these are contaminants then you would want to trace back to see where they originated.

30% reads mapping to an external gene seems like a lot ,unless you have insertions of the gene everywhere and all of them are being expressed. You have already received the suggestion of aligning to this "external gene" to see if they are indeed originating there.