How do you find the percentage of reads contaminated by rRNA?
1
0
Entering edit mode
7.3 years ago
espop23 ▴ 60

I have a file of the intersecting regions of TSS and rRNA. How do I find the percentage of reads contaminated by rRNA?

RNA R rRNA • 4.0k views
ADD COMMENT
2
Entering edit mode
7.3 years ago
Gjain 5.7k

Hi,

You can:

  1. Download the ncRNA fasta sequences (here) and the gene sets annotation file (GTF)
  2. Extract rRNA sequences from the file
  3. Then map your reads to find out the number of reads that mapped to these rRNA sequences.

In more detail:

The fasta file will look like:

head Homo_sapiens.GRCh38.ncrna.fa
>ENST00000629478 proj_ncrna:known chromosome:GRCh38:CHR_HG1832_PATCH:210374154:210374267:-1 gene:ENSG00000281499 gene_biotype:snRNA transcript_biotype:snRNA
ACACTGGTTTCTCTTCAGATCGAATAAATCTTTCGCCTTTTACTAAAGATTTCCGTGGAG
AGAAACAAATCAGTTATAAGCTAATTTTTTGTAAGCCTTGCCCTGGGGAGGCAG
>ENST00000516494 proj_ncrna:known chromosome:GRCh38:CHR_HG2128_PATCH:67546651:67546754:1 gene:ENSG00000252303 gene_biotype:snRNA transcript_biotype:snRNA
GTGCTCACTTTGGCAACATACATACTAAAATTGGACGGATACAGACATAAACATGGCCCC
TGCACAAGGATGACATGCAAATTCATGAAGCATTCCATATTTTT

And the GTF file:

head rRNA_Homo_sapiens.GRCh38.81.gtf 
1    ensembl    gene    9437669    9437778    .    -    .    gene_id "ENSG00000252956"; gene_version "1"; gene_name "RNA5SP40"; gene_source "ensembl"; gene_biotype "rRNA";
1    ensembl    gene    13623184    13623284    .    -    .    gene_id "ENSG00000222952"; gene_version "1"; gene_name "RNA5SP41"; gene_source "ensembl"; gene_biotype "rRNA";
1    ensembl    gene    34112949    34113063    .    +    .    gene_id "ENSG00000201148"; gene_version "1"; gene_name "RNA5SP42"; gene_source "ensembl"; gene_biotype "rRNA";
1    ensembl    gene    37264677    37264786    .    -    .    gene_id "ENSG00000252368"; gene_version "1"; gene_name "RNA5SP43"; gene_source "ensembl"; gene_biotype "rRNA";

You can parse these files and make a rRNA reference genome fasta which will look like:

>RNA5SP40|ENSG00000252956
GTCTATGGCCATTGCACCCTGAACGTGCCAGATCTTGTCTCATCTTGGAAGCTAAGCAGGGTTGGGCTTGGAGGGGAGGAGGGTGAACCTCAGTTCAGGTTACTTAGCCT
>RNA5SP41|ENSG00000222952
GCCTACGGCCATACCATTCTGGATGCGTCTCAGAAGCTAAGCAGGGTCAGACCTGGCTGGTACTTGGATGGGAGTATATCAGCCACTGGGTGCTGTGGTGC
>RNA5SP42|ENSG00000201148

Then map your reads to this genome and you should get it.

Another way is to map your reads to the genome and use the gtf file to annotate it and find the percentage of reads mapped to the rRNA. Both works fine with minute differences.

I hope this helps.

ADD COMMENT
0
Entering edit mode

Dear Gjain,

I aligned my reads using tophat with human reference from UCSC. Then I downloaded the gtf file from UCSC and annotated the reads using gtf file.

How do I find the percentage of reads mapped to rRNA.

I am searching for human 28S, 18S, 5.8S and 5S rRNA (RNA28S, RNA18S, RNA5-8S and RNA5S genes) and also 12S and 16S mitochondrial rRNA (MT-RNR1 and MT-RNR2) in human_genes.gtf file from UCSC.

I am unable to find these genes(RNA28S, RNA18S, RNA5-8S, RNA5S, MT-RNR1 and MT-RNR2 genes) in human_genes.gtf file. I am looking at gene_id or gene_name in gtf file.

How do I check these genes in gtf file or in any other resources.

Is it saved in some other name?

ADD REPLY

Login before adding your answer.

Traffic: 1962 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6