How to calculate distribution of mapped read over classes of RNA?
2
0
Entering edit mode
6.0 years ago
biplab ▴ 110

I am interested to find the percent of reads falls into mRNAs, snoRNAs, intergenic region etc. What is the best way to calculate percent of mapped reads falls into various classes of RNA as well as percent of reads that did not falls into any known features? Thank you so much

RNA-Seq ChIP-Seq • 3.6k views
ADD COMMENT
0
Entering edit mode

I believe RseQC has a script for that, see http://rseqc.sourceforge.net/#read-distribution-py

ADD REPLY
2
Entering edit mode
6.0 years ago

There are a few ways to do this, but first you should download the GENCODE 'Comprehensive gene annotation' reference transcriptome file for either GRCh38 / hg38 or GRCh37 / hg19. Obviously, the version will depend on the reference genome build to which you have aligned your reads.

When you download that, take a look at it in order to understand how the data is formatted. Specifically, the gene biotype and the HUGO gene names are encoded in the following attributes: gene_type gene_name

To see all of the different gene biotypes available in these, see here: Gene/Transcript Biotypes in GENCODE & Ensembl

--------------------------------

With this information under your belt, you can then count reads over each transcript:

1 - BEDTools

bedtools coverage -a gencode.v28.annotation.gtf -b MyReads.bam

Further information: http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

2 - featureCounts

featureCounts -a gencode.v28.annotation.gtf -o MyReads.counts.txt --largestOverlap -t exon -g gene_name -s 0 -p -T 8 MyReads.bam

Further information: http://subread.sourceforge.net/


In both cases, the original entries from the GTF should be included in the output. So, you will still require some scripting skills in order to parse out what you need.

Kevin

ADD COMMENT
0
Entering edit mode

Thank you so much. I used bedtools to calculate coverage over features. I got output like below for one feature in my gtf file.

VII ensembl exon    437467  437934  .   -   .   gene_id "YGL031C"; transcript_id "YGL031C"; exon_number "1"; gene_name "RPL24A"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "RPL24A"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "YGL031C.1"; 36787   468 468 1.0000000

The way to interpret the results is Read counts in for the features = 36787 Now I can use information in gene_biotype "protein_coding" to add all the read count to find number of reads falls into different type of features. In order to calculate number of reads does not mapped to any features, do I just need calculate by subtracting total numbers of reads falls into different features from number of mapped read?

ADD REPLY
1
Entering edit mode

I don't think that'd work unless each read mapped uniquely to one feature and one feature only. Is that the case? How would you say for sure from aggregate numbers?

ADD REPLY
0
Entering edit mode

That's true. Is there any option in bedtools to count only uniquely mapped reads? I did not find any. Then what is the best way to calculate percent of reads that aligned to genome but did not falls into any known features?

ADD REPLY
0
Entering edit mode

By default, I believe featureCounts does not count multi-mapped reads.

ADD REPLY
0
Entering edit mode
6.0 years ago
debitboro ▴ 260

If you get a high rate of multiple mapped reads and a low rate of uniquely mapped reads then the process to find the percentage of mapped reads by biotype will be slightly difficult. A classical way to do that is to perform the mapping and then counting the number of reads per features (featureCounts, HTSeq count), but those tools don't take into account the multiple mapped reads.

A laborious solution will be to perform the mapping against the complete genome, and against databases of each class too (against rRNA, ncRNA, tRNA, piRNA, ...). You have to download those databases and perform the mapping. After getting the different .sam files, then you can perform a kind of intersection between the sam file obtained from the mapping against the complete genome and the sam file obtained from the mapping against the database of a given class, then a simple wc -l linux command of the resulted file gives you approximately the number of reads mapped to the corresponding class. You can repeat this process for each biotype class.

ADD COMMENT

Login before adding your answer.

Traffic: 2998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6