1 - BEDTools

Question

How to calculate distribution of mapped read over classes of RNA?

0

Entering edit mode

6.0 years ago

biplab ▴ 110

I am interested to find the percent of reads falls into mRNAs, snoRNAs, intergenic region etc. What is the best way to calculate percent of mapped reads falls into various classes of RNA as well as percent of reads that did not falls into any known features? Thank you so much

RNA-Seq ChIP-Seq • 3.6k views

ADD COMMENT • link updated 6.0 years ago by debitboro ▴ 260 • written 6.0 years ago by biplab ▴ 110

0

Entering edit mode

I believe RseQC has a script for that, see http://rseqc.sourceforge.net/#read-distribution-py

ADD REPLY • link 6.0 years ago by WouterDeCoster 47k

0

Entering edit mode

6.0 years ago

debitboro ▴ 260

If you get a high rate of multiple mapped reads and a low rate of uniquely mapped reads then the process to find the percentage of mapped reads by biotype will be slightly difficult. A classical way to do that is to perform the mapping and then counting the number of reads per features (featureCounts, HTSeq count), but those tools don't take into account the multiple mapped reads.

A laborious solution will be to perform the mapping against the complete genome, and against databases of each class too (against rRNA, ncRNA, tRNA, piRNA, ...). You have to download those databases and perform the mapping. After getting the different .sam files, then you can perform a kind of intersection between the sam file obtained from the mapping against the complete genome and the sam file obtained from the mapping against the database of a given class, then a simple wc -l linux command of the resulted file gives you approximately the number of reads mapped to the corresponding class. You can repeat this process for each biotype class.

ADD COMMENT • link 6.0 years ago by debitboro ▴ 260

GenoMax · Accepted Answer · 2018-04-19

2

Entering edit mode

6.0 years ago

Kevin Blighe 87k

There are a few ways to do this, but first you should download the GENCODE 'Comprehensive gene annotation' reference transcriptome file for either GRCh38 / hg38 or GRCh37 / hg19. Obviously, the version will depend on the reference genome build to which you have aligned your reads.

GRCh38 / hg38 (Ensembl 'current release' as of 19th April 2018): https://www.gencodegenes.org/releases/current.html
GRCh37 / hg19: https://www.gencodegenes.org/releases/grch37_mapped_releases.html

When you download that, take a look at it in order to understand how the data is formatted. Specifically, the gene biotype and the HUGO gene names are encoded in the following attributes: gene_type gene_name

To see all of the different gene biotypes available in these, see here: Gene/Transcript Biotypes in GENCODE & Ensembl

--------------------------------

With this information under your belt, you can then count reads over each transcript:

1 - BEDTools

bedtools coverage -a gencode.v28.annotation.gtf -b MyReads.bam

Further information: http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

2 - featureCounts

featureCounts -a gencode.v28.annotation.gtf -o MyReads.counts.txt --largestOverlap -t exon -g gene_name -s 0 -p -T 8 MyReads.bam

Further information: http://subread.sourceforge.net/

In both cases, the original entries from the GTF should be included in the output. So, you will still require some scripting skills in order to parse out what you need.

Kevin

ADD COMMENT • link 6.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you so much. I used bedtools to calculate coverage over features. I got output like below for one feature in my gtf file.

VII ensembl exon    437467  437934  .   -   .   gene_id "YGL031C"; transcript_id "YGL031C"; exon_number "1"; gene_name "RPL24A"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "RPL24A"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "YGL031C.1"; 36787   468 468 1.0000000

The way to interpret the results is Read counts in for the features = 36787 Now I can use information in gene_biotype "protein_coding" to add all the read count to find number of reads falls into different type of features. In order to calculate number of reads does not mapped to any features, do I just need calculate by subtracting total numbers of reads falls into different features from number of mapped read?

ADD REPLY • link updated 6.0 years ago by GenoMax 141k • written 6.0 years ago by biplab ▴ 110

1

Entering edit mode

I don't think that'd work unless each read mapped uniquely to one feature and one feature only. Is that the case? How would you say for sure from aggregate numbers?

ADD REPLY • link 6.0 years ago by Ram 43k

0

Entering edit mode

That's true. Is there any option in bedtools to count only uniquely mapped reads? I did not find any. Then what is the best way to calculate percent of reads that aligned to genome but did not falls into any known features?