Simulation Of Rnaseq Read Mapping With Reads Having Different Polya Tail Lengths.
0
0
Entering edit mode
8.7 years ago

I want to calculate how much is it possible to map reads with long polyA tails in their ends as it is very likely to be discarded by most of the mapping tools. So, I would like to simulate read mapping by designing reads with polyA tails and caculate their mappability index.

Can anyone help on how to calculate "mappability index" based on polyA (mismatch) length in the end. Is it the right direction to do or should I first trim polyA tails from unmapped reads and calculate mappability of trimmed reads towards the end of transcripts. I still guess there would be few which map on initial mapping and not captured in unmapped reads pool.

My main goal is to find different transcript ends of a single transcript in a single tissue/molecule/sample from its Rnaseq data.

rna-seq simulation • 2.7k views
0
Entering edit mode

A small remark. I used to preprocess and trim for polyA reads (in addition to adapters and read quality). However, the number of reads that happened to have sequenced from fragments with poly-A were incredibly low. Probably this might have to do with the inherent GC bias of the sequencer? I just did a grep -ce "AAAAAAAAAA\$" on a fastq file from a Hi-seq 2000 lane and out of 41.7m reads, there were 5300. That is 0.012%. So, does it really matter?

0
Entering edit mode

Yes, so usually to track polyA site we take reads ending with runs of 4 A/T base, so you might get lesser reads with this long run of As. In two of my samples, I got 276,079 and 311,972 reads ending with 4 As out of 29.6m and 30.8m reads in total. This will definitely include ones with sequencing error and other false positives.

UCSC genome browser give information for around 41-43 thousand refseq transcripts in hg19 assembly, so I guess if I track these reads I might get sufficient information for quite a few transcript ends.