I tried searching and did not find relevant Q.
The problem is simple, what are the unmapped reads and how to quantify them. The unmapped reads could be contamination, polyA, some viral or bacterial sequence, or something else!
I have usually seen reads around 5% from DNA and upto 40% from RNA seq being unmapped. The numbers are high for chipSEQ and miRNA-seq as well. Some of this could be due to inefficient mapping or low quality data as well. Doing a BLAST against NR for all the unmapped reads is used but blast is terribly slow.
Either ways, looking for any resources or papers in this regard.
Organism: Human Data type: DNA, RNA, ChipSEQ (I understand RNA will have more un-mapped reads due to splice junction mapping, etc) Reference: hg19 all chr (using topHat for rna data) no preprocessing
I guess most of you are listing some or the other steps, but was hoping to get a comprehensive solution that can be implemented, so essentially all the sequenced reads are accounted for.