Existing software to calculate RNASeq read distribution by "gene_type" (Gencode)
1
2
Entering edit mode
7.1 years ago
mark.ebbert ▴ 20

Hi,

I'm looking at some total RNA-Seq data and I'd like to get a quick look at the read distribution across "gene_type", according to Gencode annotations. Briefly, Gencode GTF files (at least for hg38) include annotations for the "gene_type", which identifies whether the region is protein coding, lincRNA, miRNA, etc. I'd like to know what percentage of reads originate from each type. RSeQC and RNA-SeQC seem to offer similar features (e.g., by exon, intron, UTR, etc.), but I don't think either provides by "gene_type".

Anyone know of existing software that will do this? I was planning to modify RSeQC read_distribution.py, but that works from BED file annotations (i.e., doesn't have "gene_type" annotations). RNA-SeQC doesn't provide source code. I could de-compile it and modify, but I don't see a specific license.

I appreciate your help.

RNA-Seq gencode rna-seqc RSeQC • 2.3k views
ADD COMMENT
0
Entering edit mode

Not sure about existing software, but couldn't you just transform your GTF file into a BED file and use that with RSeQC?

ADD REPLY
3
Entering edit mode
7.1 years ago

You should be able to do this with CGAT.

Briefly you want to prepare a bed file where the name column of the bed is taken from the gene_type attribute of the GTF. You can do this with

cgat gtf2bed --set-name=gene_biotype -I gencode.gtf.gz -S gencode_gene_type.bed.gz

You can then use cgat bam_vs_bed to count the number of reads that fall into each bed category.

cgat bam_vs_bed -a my_bam.bam -b genecode_gene_type.bed.gz > biotype_counts.tsv
ADD COMMENT
0
Entering edit mode

I appreciate your help. I've been tinkering with cgat off and on and finally got it working. Thought I'd share some "gotchas":

  1. This is a little one, but the first command should be cgat gff2bed --is-gtf.
  2. You have to either use the Ensembl gtf, or you need to modify the gencode gtf to have a field named gene_biotype. gff2bed will not let you specify any arbitrary field ID (unless I missed something)
  3. If I remember correctly, the second command (bam_vs_bed) should also specify the output using -S rather than redirecting stdout to a file.
  4. bam_vs_bed should also specify --assume-sorted, if you know your .bed and .bam are sorted the same way. I wasted a bunch of time trying to figure out what was "wrong" with either my .bed or .bam because I didn't realize bam_vs_bed sorts your .bed by default using the standard unix sort method (lexigraphic rather than "natural" or historic karyotypic sorting).
  5. I still haven't figured this one out, but somehow, the "total" number of alignments are fewer than those for "protein_coding" (by an order of magnitude), rather than being the sum.
  6. I also need to remove secondary alignments from the .bam since there doesn't seem to be an option for bam_vs_bed to ignore secondary alignments. Perhaps this issue will resolve #5 as well, if "total" somehow ignores the secondary alignments.

Really appreciate your help!

ADD REPLY
0
Entering edit mode

@mark.ebbert - I am experiencing the same issue as you described in your #5. Just wondering, have you find the answer for it?

ADD REPLY
0
Entering edit mode

I wonder if some protein coding things are getting counted multiple times because there are multiple transcripts? I.e. multiple entries in the bed covering the same part of the sequence?

ADD REPLY

Login before adding your answer.

Traffic: 2517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6