Question

Existing software to calculate RNASeq read distribution by "gene_type" (Gencode)

2

Entering edit mode

7.1 years ago

mark.ebbert ▴ 20

Hi,

I'm looking at some total RNA-Seq data and I'd like to get a quick look at the read distribution across "gene_type", according to Gencode annotations. Briefly, Gencode GTF files (at least for hg38) include annotations for the "gene_type", which identifies whether the region is protein coding, lincRNA, miRNA, etc. I'd like to know what percentage of reads originate from each type. RSeQC and RNA-SeQC seem to offer similar features (e.g., by exon, intron, UTR, etc.), but I don't think either provides by "gene_type".

Anyone know of existing software that will do this? I was planning to modify RSeQC read_distribution.py, but that works from BED file annotations (i.e., doesn't have "gene_type" annotations). RNA-SeQC doesn't provide source code. I could de-compile it and modify, but I don't see a specific license.

I appreciate your help.

RNA-Seq gencode rna-seqc RSeQC • 2.3k views

ADD COMMENT • link updated 7.1 years ago by i.sudbery 19k • written 7.1 years ago by mark.ebbert ▴ 20

0

Entering edit mode

Not sure about existing software, but couldn't you just transform your GTF file into a BED file and use that with RSeQC?

ADD REPLY • link 7.1 years ago by James Ashmore ★ 3.4k

score 3 · Accepted Answer · 2017-03-26

3

Entering edit mode

7.1 years ago

i.sudbery 19k

You should be able to do this with CGAT.

Briefly you want to prepare a bed file where the name column of the bed is taken from the gene_type attribute of the GTF. You can do this with

cgat gtf2bed --set-name=gene_biotype -I gencode.gtf.gz -S gencode_gene_type.bed.gz

You can then use cgat bam_vs_bed to count the number of reads that fall into each bed category.

cgat bam_vs_bed -a my_bam.bam -b genecode_gene_type.bed.gz > biotype_counts.tsv

ADD COMMENT • link 7.1 years ago by i.sudbery 19k

0

Entering edit mode

I appreciate your help. I've been tinkering with cgat off and on and finally got it working. Thought I'd share some "gotchas":

This is a little one, but the first command should be cgat gff2bed --is-gtf.
You have to either use the Ensembl gtf, or you need to modify the gencode gtf to have a field named gene_biotype. gff2bed will not let you specify any arbitrary field ID (unless I missed something)
If I remember correctly, the second command (bam_vs_bed) should also specify the output using -S rather than redirecting stdout to a file.
bam_vs_bed should also specify --assume-sorted, if you know your .bed and .bam are sorted the same way. I wasted a bunch of time trying to figure out what was "wrong" with either my .bed or .bam because I didn't realize bam_vs_bed sorts your .bed by default using the standard unix sort method (lexigraphic rather than "natural" or historic karyotypic sorting).
I still haven't figured this one out, but somehow, the "total" number of alignments are fewer than those for "protein_coding" (by an order of magnitude), rather than being the sum.
I also need to remove secondary alignments from the .bam since there doesn't seem to be an option for bam_vs_bed to ignore secondary alignments. Perhaps this issue will resolve #5 as well, if "total" somehow ignores the secondary alignments.

Really appreciate your help!

ADD REPLY • link 7.0 years ago by mark.ebbert ▴ 20

0

Entering edit mode

@mark.ebbert - I am experiencing the same issue as you described in your #5. Just wondering, have you find the answer for it?

ADD REPLY • link 6.4 years ago by tushardave26 ▴ 10

0

Entering edit mode

I wonder if some protein coding things are getting counted multiple times because there are multiple transcripts? I.e. multiple entries in the bed covering the same part of the sequence?

ADD REPLY • link 6.4 years ago by i.sudbery 19k