Question

How To Quantify Unmapped Reads

6

Entering edit mode

11.9 years ago

Bioinfosm ▴ 620

I tried searching and did not find relevant Q.

The problem is simple, what are the unmapped reads and how to quantify them. The unmapped reads could be contamination, polyA, some viral or bacterial sequence, or something else!

I have usually seen reads around 5% from DNA and upto 40% from RNA seq being unmapped. The numbers are high for chipSEQ and miRNA-seq as well. Some of this could be due to inefficient mapping or low quality data as well. Doing a BLAST against NR for all the unmapped reads is used but blast is terribly slow.

Either ways, looking for any resources or papers in this regard.

Thanks!

More details

Organism: Human
Data type: DNA, RNA, ChipSEQ (I understand RNA will have more un-mapped reads due to splice junction mapping, etc)
Reference: hg19 all chr (using topHat for rna data)
no preprocessing

I guess most of you are listing some or the other steps, but was hoping to get a comprehensive solution that can be implemented, so essentially all the sequenced reads are accounted for.

next-gen sequencing • 9.0k views

ADD COMMENT • link updated 11.9 years ago by swbarnes2 14k • written 11.9 years ago by Bioinfosm ▴ 620

0

Entering edit mode

What kind of data do you have? I am assuming RNA-seq since you mentioned polyA. What are you aligning against? What you align against will be a factor in how many unmapped reads you get. do you use hg19 chr1-22,X,Y,M or do you also include supercontigs. Do you do any preprocessing to remove artifacts before aligning? What is your organism? Your question is missing a lot of important details and you shoud edit it to make these points more clear.

ADD REPLY • link 11.9 years ago by Ying W ★ 4.2k

score 3 · Answer 1 · 2012-05-24

What kind of data you have? I think there can be lot of explanations for unmapped reads and it depends on your analysis, what are you looking for.

As you said, there can be contamination or some viral sequence etc.
Or if you have RNAseq data, reads can fall on splice site junctions.
Or there can be problem while doing sequencing (more errors).
....................

Now if you are just interested in SNP's, then I think you don't have to worry about unmapped reads.

But if you are interested in structural variation, then you should keep your unmapped reads because they may fall at breakpoint.

If you are interested in finding if there is any virus DNA, then you can use BLAST as you said or if you want something fast as compare to BLAST, you can use BLAT.

If you think there are more sequencing errors, then you can allow more mismatches while aligning the reads.

For RNAseq, unmapped reads can be aligned using split read approach like TOPHAT or SOAPSPLICE use.

So you have to ask yourself, what kind of data you have?, how it was generated?, what are you interested in?, Is fraction of unmapped reads is very high as compare to what you were expecting? So different data, different analysis, different approaches. :)

EDIT: Found a paper for RNAseq data, they are doing something with unmapped reads (I have not read it).

score 2 · Answer 2 · 2012-05-24

It totally depends on what kind of experiment you are doing.

For starters, you can do de novo assembly on unmapped reads, see if they form contigs. But that won't work unless you have >10x-ish coverage of whatever that sequence is.

You can do something like :

samtools view -f4 whole.bam | cut -f10 | sort | uniq -c | sort -nr > unmapped_unique.count

Which will tell you how many examples of each unmapped read you have. Poly-As will be pretty obvious by that rounte, as will primer dimers, or adaptor dimers.

You can also check the QC of the unmapped reads, sometimes, reads have crappy quality, and won't map because the sequence is inaccurate.

score 1 · Answer 3 · 2012-05-24

1

Entering edit mode

11.9 years ago

JC 13k

Quantify or annotate unmapped reads? Quantify is simple, just count them :)

In the other hand, the source as you pointed can be various explanations: errors in sequencing, contamination, unknown genes, ...

You can try to identify the sources, filter for low quality reads with DUST or simply counting bad positions, then check known contamination sources like microorganism expected to be there (for human there are several efforts to identify our microbiota, like http://commonfund.nih.gov/hmp/), finally try to align to the reference using blat/blast to improve the map.

ADD COMMENT • link 11.9 years ago by JC 13k

0

Entering edit mode

By quantify I meant annotate and count. Which reads go into which group/category

ADD REPLY • link 11.9 years ago by Bioinfosm ▴ 620