Filter all rRNA sequences when doing ribosomal profiling, using rRNA database or GenBank file?
3
3
Entering edit mode
4.6 years ago
AlicePsyche ▴ 30

Hi, all

Recently I need to do some analysis about ribosomal profiling data. A lot of papers recommended to filter rRNA sequences before mapping to genome. So which way is better? Using rRNA database or only use your organism annotation file(GenBank)? It seems that many people use this website database: https://www.arb-silva.de/download/arb-files/

For genbank method, I plan to download files from ensembl: ftp://ftp.ensembl.org/pub/release-79/genbank/danio_rerio/ and then use biopython to pull out all rRNA related sequences.

What is the normal way to get rid of rRNA in ribosomal profiling data analysis? Any suggestion is welcome!

RNA-Seq sequencing Biopython next-gen genome • 3.0k views
2
Entering edit mode
4.6 years ago

If you happen to be working on an organism that has the full rRNA cassette sequence in its reference genome then you won't need to additionally align against something from the silva database. If not, align against that beforehand.

Note that this is probably not sufficient (at least, it hasn't been for me). You should additionally filter out anything that hits 5S rRNA or 5.8S rRNA or any rRNA repeat regions that repeatmasker found. I would do the same for tRNA regions. You may also have issues with some piRNAs, but perhaps you'll have more luck there. Every time I work with a new organism with RiboSeq data I end up having to come up with a slightly modified filtering procedure :(

0
Entering edit mode

Thanks for your helpful reply! I am working on zebrafish, not having a full rRNA cassette sequence :( If I understand correctly, I probably should merge silva database and repeatmasker(rRNA&tRNA) file then do filtering mapping, right?

0
Entering edit mode

Given the answer from Charles Plessy, you can probably just blacklist a bunch of regions. I don't know how you were planning on analysing the data. When I last did something like this, I used deepTools and a bit of python for the final stuff, so I could trivially blacklist regions. If you plan to use something else, you'll want to make a BED file and reverse intersect with it (bedtools intersect).

0
Entering edit mode

I plan to map raw data to rRNA region using bowtie and then map unmapped fastq file to genome using tophat, as this paper suggested: http://www.nature.com/nature/journal/v503/n7476/full/nature12632.html

By the way, I have tested my data using repeatmasker file (rRNA&tRNA), it turned out that only 2% raw reads mapped to rRNA region. Is it normal?

I use deepTools a lot when dealing with ChIP-seq data ;) never try with RNA related experiments. Maybe it's time... Thanks!

0
Entering edit mode

2% is lower than what I got with human/mouse, but that likely just means your library prep was better than what I was given :)

You'll find the --Offset option in bamCoverage useful. I created that for RiboSeq and related datasets so I could quickly check for pausing with a bit of python.

1
Entering edit mode
4.6 years ago
Charles Plessy ★ 2.8k

Some years ago I searched for the rRNA sequences in the zebrafish genome version 9. You can find my notes on GitHub (charles-plessy/zebrafish_rRNA). I hope they can be useful to you. Comments are welcome.

0
Entering edit mode

0
Entering edit mode
4.6 years ago
h.mon 33k

What was used on the papers you read? That would be a good start.

Or use SortMeRNA or BBDuk, both of them are mentioned plenty of times on this forum.

0
Entering edit mode

Oh, the papers do not specify a rRNA database... I searched and found there may be many options.

Thanks! I am not familiar with rRNA, would study the database you mentioned carefully.