Question: Existing software to calculate RNASeq read distribution by "gene_type" (Gencode)
1
gravatar for mark.ebbert
21 months ago by
mark.ebbert0 wrote:

Hi,

I'm looking at some total RNA-Seq data and I'd like to get a quick look at the read distribution across "gene_type", according to Gencode annotations. Briefly, Gencode GTF files (at least for hg38) include annotations for the "gene_type", which identifies whether the region is protein coding, lincRNA, miRNA, etc. I'd like to know what percentage of reads originate from each type. RSeQC and RNA-SeQC seem to offer similar features (e.g., by exon, intron, UTR, etc.), but I don't think either provides by "gene_type".

Anyone know of existing software that will do this? I was planning to modify RSeQC read_distribution.py, but that works from BED file annotations (i.e., doesn't have "gene_type" annotations). RNA-SeQC doesn't provide source code. I could de-compile it and modify, but I don't see a specific license.

I appreciate your help.

rseqc gencode rna-seq rna-seqc • 639 views
ADD COMMENTlink modified 21 months ago by i.sudbery2.8k • written 21 months ago by mark.ebbert0

Not sure about existing software, but couldn't you just transform your GTF file into a BED file and use that with RSeQC?

ADD REPLYlink written 21 months ago by James Ashmore2.5k
3
gravatar for i.sudbery
21 months ago by
i.sudbery2.8k
Sheffield, UK
i.sudbery2.8k wrote:

You should be able to do this with CGAT.

Briefly you want to prepare a bed file where the name column of the bed is taken from the gene_type attribute of the GTF. You can do this with

cgat gtf2bed --set-name=gene_biotype -I gencode.gtf.gz -S gencode_gene_type.bed.gz

You can then use cgat bam_vs_bed to count the number of reads that fall into each bed category.

cgat bam_vs_bed -a my_bam.bam -b genecode_gene_type.bed.gz > biotype_counts.tsv
ADD COMMENTlink written 21 months ago by i.sudbery2.8k

I appreciate your help. I've been tinkering with cgat off and on and finally got it working. Thought I'd share some "gotchas":

  1. This is a little one, but the first command should be cgat gff2bed --is-gtf.
  2. You have to either use the Ensembl gtf, or you need to modify the gencode gtf to have a field named gene_biotype. gff2bed will not let you specify any arbitrary field ID (unless I missed something)
  3. If I remember correctly, the second command (bam_vs_bed) should also specify the output using -S rather than redirecting stdout to a file.
  4. bam_vs_bed should also specify --assume-sorted, if you know your .bed and .bam are sorted the same way. I wasted a bunch of time trying to figure out what was "wrong" with either my .bed or .bam because I didn't realize bam_vs_bed sorts your .bed by default using the standard unix sort method (lexigraphic rather than "natural" or historic karyotypic sorting).
  5. I still haven't figured this one out, but somehow, the "total" number of alignments are fewer than those for "protein_coding" (by an order of magnitude), rather than being the sum.
  6. I also need to remove secondary alignments from the .bam since there doesn't seem to be an option for bam_vs_bed to ignore secondary alignments. Perhaps this issue will resolve #5 as well, if "total" somehow ignores the secondary alignments.

Really appreciate your help!

ADD REPLYlink written 20 months ago by mark.ebbert0

@mark.ebbert - I am experiencing the same issue as you described in your #5. Just wondering, have you find the answer for it?

ADD REPLYlink written 12 months ago by tushardave2610

I wonder if some protein coding things are getting counted multiple times because there are multiple transcripts? I.e. multiple entries in the bed covering the same part of the sequence?

ADD REPLYlink written 12 months ago by i.sudbery2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1461 users visited in the last hour