Question

All my reads fall in intergenic regions ?

0

Entering edit mode

7.2 years ago

debitboro ▴ 270

Hi biostars,

I've performed an alignment using Bowtie on small RNAseq reads (22-50 nt) from total RNA-Seq sequencing experiment. I got almost 90% of multiple mapped reads. Then, I counted the reads per biotype (gtf file from Ensembl) using mmquant program (which is designed for counting tasks in the case of high rate of multiple mapping reads, HTSeqcount and featureCounts don't take into account the multiple mapped reads, that is why I've used mmquant). After getting the matrix of count, and using a shell script I was able to count the reads per biotype class (protein_coding, lincRNA, rRNA, ...). I got like 80% of the alignments falling in intergenic regions (lincRNA), and only 6% of my reads correspond to protein_coding !!!

Can I continue downstream analysis with such results ?

Any idea ?

RNA-Seq count per biotype intergenic regions • 4.0k views

ADD COMMENT • link updated 7.2 years ago by Friederike 9.0k • written 7.2 years ago by debitboro ▴ 270

0

Entering edit mode

Did I understand correctly that you have sequenced small RNAs such as miRNA and expect protein coding genes?

ADD REPLY • link 7.2 years ago by WouterDeCoster 48k

0

Entering edit mode

It is total RNAseq experiment. The sequencing has been done on degraded RNA samples (single-end) and with a particular library preparation protocol, that is why I got very short RNAseq reads. We don't target any class of RNAs.

ADD REPLY • link 7.2 years ago by debitboro ▴ 270

0

Entering edit mode

If it's total RNA I would expect that you have >80% rRNA

ADD REPLY • link 7.2 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

80% rRNA

even if rRNAs have been removed during the experiment with rRNA depletion kit ?

ADD REPLY • link updated 7.2 years ago by Ram 45k • written 7.2 years ago by debitboro ▴ 270

0

Entering edit mode

You did not include that critical piece of information in original post. If that is true (and if the depletion did work as expected) it is unclear why you have 90% multi-mapped reads (per featureCounts/htseq-count?).

ADD REPLY • link 7.2 years ago by GenoMax 152k

0

Entering edit mode

Since the length of my reads is distributed between 22-50 nt, I think it is clear I got a high rate of multiple mapped reads. A very short read of 25 nt will get a higher number of multiple aligned locations on the genome than a read of a higher length. I am right ?

ADD REPLY • link 7.2 years ago by debitboro ▴ 270

0

Entering edit mode

No, in that case no. Sorry, I forgot that option.

ADD REPLY • link 7.2 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Just to confirm. You are expecting to get smallRNA reads from a total RNAseq dataset only because you are aligning with bowtie v.1?

ADD REPLY • link 7.2 years ago by GenoMax 152k

score 1 · Answer 1 · 2018-04-30

I strongly recommend you do stringent quality controls with established tools such as QoRTs or RSeQC.

Things that often go wrong with RNA-seq and that you may want to look out for:

DNA contamination (many reads mapping to non-annotated loci)
lack of library diversity, i.e., you started with very few viable RNA-seq molecules, ended up amplifying those and then sequencing the same sequences over and over again.
rRNA contamination -- your initial statement about the abundance of multiply aligned reads sounds like this may actually be the case for your data
3' bias -- with highly degraded RNA, this is often seen

Why are you using Bowtie instead of established spliced-aware aligners such as STAR? Is there a reason for you to expect that you get mostly multiply aligning reads? I.e., did you enrich for repetitive regions?

Can I continue downstream analysis with such results ?

That depends on the questions you're interested in and the analyses you have in mind.