Tutorial: Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor
33
gravatar for Irsan
4.7 years ago by
Irsan6.6k
Amsterdam
Irsan6.6k wrote:

Hi all,

It took me a while to figure this out so I thought it might be useful to a few other people.

When you have used htseq-count on each of your RNA-seq'ed samples and have combined all of your samples in a read count matrix, you might want to turn the absolute read counts into FPKM at some time. Therefore you need the total amount (maybe TMM/quartile-normalized) of mapped reads for each sample (easy) and the total exonic length for each gene (tricky).

The problem is that a gene can have multiple transcripts where the exons of transcript 1 might be partially overlapping with the exons of transcript2. So simply adding up the length of all exons annotated for a specific gene overestimates the gene size while taking the length of just 1 of the transcripts underestimates. Using the GenomicFeatures package from Bioconductor you can easily handle this issue.

# First, import the GTF-file that you have also used as input for htseq-count
library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF("yourFile.gtf",format="gtf")
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(txdb,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})

It takes about half an hour if you do it for a GTF-file from ensGene table genome hg19 but it gets the job done :-)

ADD COMMENTlink modified 4 months ago by lhdcsu10 • written 4.7 years ago by Irsan6.6k
2

Thanks a lot, save my time writing a perl script. :)

ADD REPLYlink written 4.6 years ago by yyaobo20

Nice to hear that it's useful to others

ADD REPLYlink written 4.6 years ago by Irsan6.6k
1

Thank you for this script !

ADD REPLYlink written 4.4 years ago by variantcalling.biosolution10
1

Really Really tanks!!

ADD REPLYlink written 4.2 years ago by fusion.slope150

you're welcome. maybe you should make your post a comment instead of an answer

ADD REPLYlink written 4.2 years ago by Irsan6.6k
1

Thanks so much, this saves me a lot of time!

ADD REPLYlink written 4.2 years ago by mararie10

Can different genes that share the same exons cause similar problem?

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by PoGibas4.7k

Because you reduce overlapping exons for each individual gene seperately, they will contribute to both gene sizes like the way you want

ADD REPLYlink written 4.7 years ago by Irsan6.6k

How about total non-overlapping introns per gene lengths? Introns are grouped by transcripts.How could I group them by genes? Thanks!

ADD REPLYlink written 4.5 years ago by liux.bio140

Oef, I have to look that up in the GenomicRanges manual. Their is an introns() accessor, that might bring you somewhere

ADD REPLYlink written 4.5 years ago by Irsan6.6k

Hi Irsan, How can you collect exons per gene_name, instead of gene id? I am using a gtf file Thank you

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by hoangtv20

Do you mean gene symbol in stead of gene name? Like TP53/RB1/E2F3?

ADD REPLYlink written 4.4 years ago by Irsan6.6k

Do you mean gene symbol in stead of gene name? Like TP53/RB1/E2F3?

ADD REPLYlink written 4.4 years ago by Irsan6.6k

Yes, i am using annotation gtf file from Ensembl. I want to get gene symbol, instead of Ensembl gene ID. Do you know how? Thanks

ADD REPLYlink written 4.4 years ago by hoangtv20

Many ensembl gene ids do not have an official HGNC symbol (that is what you are looking for). I think the gene symbols are not even in the gtf-file. So you have to work with Ensembl gene IDs and at the end annotate them with gene symbol with for example bioconductor package org.Hs.eg.db

ADD REPLYlink written 4.4 years ago by Irsan6.6k

I used gtf file for human from here ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/. but the

makeTranscriptDbFromGFF gave me the following error,

   error in evaluating the argument 'x' in selecting a method for function 'unique': Error in subs[[i]] : subscript out of bounds

is there any suggestions ? probably a point to another gtf file ?

 

ADD REPLYlink written 4.1 years ago by Quak270
1

Download GTF-file from UCSC table browser

ADD REPLYlink written 4.1 years ago by Irsan6.6k

didin't work - I decided using http://bioconductor.org/packages/2.10/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html

ADD REPLYlink written 4.1 years ago by Quak270
Nice it works for you right now. I still wonder what the problem is with the ensembl 75 gtf file. Have you tried removing the first few comment lines (starting with #) in the gtf?
ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Irsan6.6k

I've tried this and it might not be the best idea. When creating a GTF-file, the UCSC table browser fills in the gene_ID field with the transcript_ID, no matter whether the UCSC, RefSeq or Ensembl track is used, which defeats the purpose of your script. I have no idea why UCSC would do this, seems like a very dangerous oversight on their part.

Thanks for the script though, it's saved me a lot of time.

ADD REPLYlink written 2.1 years ago by Matt Miossec320

The last part:

exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})

Seems to be taking extraordinarily long to run. Any faster method?

ADD REPLYlink written 4.0 years ago by essell2210
1

sum(width(reduce(exons.list.per.gene))) works directly and should be faster

ADD REPLYlink written 2.6 years ago by pgp.martin10
it takes 30 minutes for ensGene table with one 2.67 Ghz cpu
ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Irsan6.6k

Hi Irsan,

Thanks for a very helpful chunk of code! It solved two of my problems. First I got the length for calculating FPKM which I need later in my analysis, secondly it gave an idea how to quickly count the number of exons per gene which I also needed.

Just a quick comment. Instead of using lapply I simply run:

exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))

Which took only about 2 seconds and returned a data.frame which I could easly merge with the rest of my data.

ADD REPLYlink written 8 months ago by ggusiak0

Thank you so much for sharing the code! I am a beginner, just wondering if you could help with how to input the length information into the dds data. Many thanks!

ADD REPLYlink written 18 days ago by shenwei13760
5
gravatar for Kamil
3.2 years ago by
Kamil1.8k
Boston
Kamil1.8k wrote:

Here's another way to do it with Python instead of R:

ADD COMMENTlink written 3.2 years ago by Kamil1.8k

Thanks for this. Can I just triple-check that the code "coding_lengths.py" is returning the total length of all of the exons (I am trying to calculate RPKM, so I think this is perfect to calculate gene lengths for that purpose).

And then just as a side note, is this data available for other species? I'm on the FTP site for "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz" now, and I can seeing anything resembling a folder for other species? Similar for the Sanger FTP site, I can see human and mouse; my work involves human mouse and rat, I'm wondering if this script would work on all of these species, and if not, if someone could suggest an alternative method?

ADD REPLYlink written 17 months ago by Tom10
1

I updated the comment in count_bp with a visual example of how exons are counted.

You can find gene annotations in GTF format for many species here:

ftp://ftp.ensembl.org/pub/release-87/gtf/

Note that the version (87 at the time of this post) is continually updated over time.

Regarding converting Ensembl ID to Entrez ID or vice versa, I recommend you use Ensembl Biomart.

ADD REPLYlink modified 17 months ago • written 17 months ago by Kamil1.8k

Thanks. I actually tried to run the script using exactly as above and I ran into an error. In the attached picture: picture, in the purple section I have shown the first few files of the .gtf file, and in the orange section the gene2ensembl file, just to show that I've just downloaded these exactly as you said. Then the green section I have run the code.

The code is also here:

python coding_lengths.py -g gencode.v19.annotation.gtf -n gene2ensembl -o output

Wed Jan 18 11:30:46 2017 # Reading the Gencode annotation file: gencode.v19.annotation.gtf
Wed Jan 18 11:34:36 2017 # done in 230 s
coding_lengths.py:81: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  exon.sort(['seqname','start','end'], inplace=True)
Wed Jan 18 11:34:39 2017 # Calculating coding region (exonic) length for each gene...
Wed Jan 18 11:38:18 2017 # done in 219 s
Wed Jan 18 11:38:18 2017 # Reading NCBI mapping of Entrez GeneID to Ensembl gene identifier: gene2ensembl
Traceback (most recent call last):
  File "coding_lengths.py", line 163, in <module>
    main(args)
  File "coding_lengths.py", line 99, in main
    'Ensembl_protein_identifier'])
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4948)
  File "pandas/parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas/parser.c:7386)
  File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
  File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

I know it is not your job to solve my system errors. My only question is; to a more experienced coder, does this look like a problem with the script itself, or does it look like a problem on my end (i.e. with the way packages are installed or something)?

Thanks

ADD REPLYlink modified 17 months ago • written 17 months ago by Tom10
1

My script expects to read a file compressed with gzip, but you provided a decompressed file.

Try running gzip gene2ensembl and then run the script again.

ADD REPLYlink written 17 months ago by Kamil1.8k

Thank you, you were so helpful, the script is absolutely fantastic.

ADD REPLYlink written 17 months ago by Tom10

Just a small note to everyone, the Ensembl GTF files are not the same as the NCBI Gencode GTF files. For this script, the only difference you seem to need to make is in the coding_lengths.py script, change line 73 to:

idx = (gc.feature == 'exon') & (gc.transcript_biotype == 'protein_coding')

(so you're changing gc.transcript_type to gc.transcript.biotype).

ADD REPLYlink written 17 months ago by Tom10
1
gravatar for lhdcsu
4 months ago by
lhdcsu10
lhdcsu10 wrote:

Recommend the GTFtools package at: http://www.genemine.org/gtftools.php.

Using the '-l' options, four types of gene lengths will be calculated. One of them is the non-overlapping exon length (merge all exons of multiple isoforms of the same gene).

ADD COMMENTlink modified 4 months ago • written 4 months ago by lhdcsu10
0
gravatar for yc1123
3.2 years ago by
yc11230
United States
yc11230 wrote:

Thanks for the scripts. But when I run the 2nd line "exons.list.per.gene", I get "error, RS-DBI driver: (expired SQLiteConnection)". Dose anyone know why ?

ADD COMMENTlink written 3.2 years ago by yc11230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1057 users visited in the last hour