Identify RNA-seq reads containing polyA sequence
2.7 years ago
goodez ▴ 510

This may seem like a weird question, but we need to filter our RNAseq data for reads that contain polyA. The data is stranded RNA-seq, 50 bp reads. Would it be easier to find these reads before or after alignment? To be clear, the 50 bp read needs to contain a stretch of polyA, not just come from a transcript containing polyA. Has anyone done this type of analysis?

2.7 years ago
GenoMax 111k

Use bbduk.sh from BBMap suite in filter mode (don't specify ktrim= or qtrim= options) with literal=AAAAA to filter the reads out (adjust length of A's as needed). Use with original data.

Thanks! I have some additional questions then. Since it is stranded RNA-seq, the polyA will actually be stretches of TTTTTT right?

Also I used grep to look for reads containing this, and many of the TTTTTT stretches are in the middle of a read. It doesn't seem possible that the polyA could be surrounded by other sequence on both ends.

If you are capturing second strand then yes. Past the TTTTT the sequence may be going into adapters. You can easily check that by trimming reads you filter and select.

2.7 years ago
Buffo ★ 1.9k

If data is stranded, polyA tails will be always at the end of sequences, I have read a couple of papers where this information is important (to define UTRs mainly, if I remember the reference I will post it). They usually consider a 8-10 nucleotides as the minimum length.

Thanks, that is good to know. Please do share the reference if you find it again!