Question: How does one count unmapped transcripts ?
0
gravatar for roy.granit
5 weeks ago by
roy.granit800
Israel/LabWorm
roy.granit800 wrote:

Hello all, I have a list of un-mapped reads, is there a way to generate a ranked list of transcripts, meaning count the most frequent ones?

Thanks!

rna-seq • 135 views
ADD COMMENTlink modified 5 weeks ago by Charles Warden7.3k • written 5 weeks ago by roy.granit800

Well, if it is unmapped this suggests those reads do not originate from the (annotated) transcriptome. What makes you think they are meaningful?

ADD REPLYlink written 5 weeks ago by ATpoint24k

Indeed, I'm aware that I cannot align these sequences to the transcriptome/genome, but these reads are derived from a certain cell line which might express some 'external' genes that were transfected and I wish to know how many of the reads map to these genes (before I generate a hybrid genome)

ADD REPLYlink written 5 weeks ago by roy.granit800
1

It sounds like you know where these "external" genes have come from. If you have a fasta file with the sequence of these "external" genes then use that for mapping and see how many of the unmapped reads map to this second fasta/genome file.

Depending on your mapping algorithm you might have an unmapped.bam file. You can run bamToFastq and then remap those unmapped reads to your "external" genome. From there you can determine if you want to remap all genes to a hybrid genome.

ADD REPLYlink written 5 weeks ago by shawn.w.foley1.1k
0
gravatar for swbarnes2
5 weeks ago by
swbarnes26.7k
United States
swbarnes26.7k wrote:

I guess you could try to assemble the unmapped reads, and blast the best looking contigs to nr, see if you can identify where they came from. Or just blast the most common unmapped reads.

ADD COMMENTlink written 5 weeks ago by swbarnes26.7k

But how do I find the most common reads to blast them?

ADD REPLYlink written 5 weeks ago by roy.granit800
2

If you have reads in a .sam file and you want to count their frequency you could make a quick python script:

inFile = open('unmapped.sam','r')
outFile = open('unmapped_read_count.txt','w')
countDict = {}
line = inFile.readline()

#Remove header starting with @ symbol
while line.startswith('@'):
    line = inFile.readline()

#Read through file
while line != '':
    lineList = line.rstrip().split('\t')
    if lineList[9] not in countDict: #If this is the first time this sequence is seen
        countDict[lineList[9]] = 1
    else:
        countDict[lineList[9]] = countDict[lineList[9]] + 1 #If this sequence has been seen before add one to the number of times it's been seen
    line = inFile.readline()

#Loop through dictionary and print tab separated read and number of counts
for read in countDict.keys():
    outFile.write(read+'\t'+str(countDict[read])+'\n')

inFile.close()
outFile.close()

It's worth noting that this will count the number of occurrences for identical reads, if you have a sequence that you expect these reads to map to it's a much better idea to map these to that template like I suggested in the above comment.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by shawn.w.foley1.1k
2
samtools view -f 4 mydata.bam | cut -f 10 | sort | uniq -c | sort -nr | head
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by swbarnes26.7k

Thanks. Ended up using this code after converting my .mate1 files into bam using picardtools

ADD REPLYlink written 5 weeks ago by roy.granit800
0
gravatar for Charles Warden
5 weeks ago by
Charles Warden7.3k
Duarte, CA
Charles Warden7.3k wrote:

Do you have a lot of unmapped reads?

Sometimes, there may really be something else contributing to true RNA-Seq sequencing in your sample (where there may actually be benefit to using a joint reference). For example, if you have a virus infection and there are a lot of virus reads (where you may want to use a human+virus reference).

However, I think there is always some amount of cross-contamination and other factors that are going to keep all 100% of your reads from being important to understand. For example, if you have filtered things like adapters and other non-biological sequences, and you still have <2% unexplained reads, I think limits to the genome assembly, purity of de-muliplexed samples, etc. could be a factor.

Not sure if this is relevant to you, but I think there are currently top PhiX BLAST hits that aren't accurate (because someone with a 16S study didn't understand that you can get PhiX sequence in your de-multiplexed samples, even though it wasn't supposed to be there, probably with a single-barcode):

C: Calling Single-Barcode Samples from Mixed Runs as Dual-Barcode Samples | Possibl

In other words, I think it makes a difference whether you have 30% unaligned reads or 5% unaligned reads.

ADD COMMENTlink written 5 weeks ago by Charles Warden7.3k

After adaptor cleaning I have about 30% of unaligned reads therefore I wish to inspect these reads and see what is the source, ideally it would be great to blast these sequences since I have no clue if these originate due to bacterial or viral contamination.

ADD REPLYlink written 5 weeks ago by roy.granit800
1

That does seem pretty high.

I would probably recommend converting those reads to FASTQ (if not done already) and:

1) Look for Over-Represented Sequences using FastQC

2) Perform a de novo assembly and check the largest and most highly expressed unaligned contigs. While you may need to test multiple methods, one possibility is to use Oases for assembly, Bowtie2 + eXpress for alignment / quantification, and BLAST for annotation.

ADD REPLYlink written 5 weeks ago by Charles Warden7.3k
1

Of that 30% how many are categorized as "too short"? @Charles Warden makes a good point about FastQC, you want to be certain this isn't a sequencing/library prep artifact before investing too much time into determining the source of these reads.

ADD REPLYlink written 5 weeks ago by shawn.w.foley1.1k

Yes - good point. Thank You. FastQC will also give you other statistics as well, including read lengths (assuming you have trimmed them before your 1st alignment).

If that is the problem, your read trimmer probably has a parameter to require a minimum length of read (which is probably a good idea).

ADD REPLYlink written 5 weeks ago by Charles Warden7.3k
1

roy.granit : Have you taken a few of these and blasted them at NCBI? I would start there before you invest a lot of time. If these are contaminants then you would want to trace back to see where they originated.

30% reads mapping to an external gene seems like a lot ,unless you have insertions of the gene everywhere and all of them are being expressed. You have already received the suggestion of aligning to this "external gene" to see if they are indeed originating there.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax73k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1802 users visited in the last hour