Question

Question: Calculating each gene's mean of intron length

0

Entering edit mode

4.6 years ago

fabregas23213 • 0

Hi, I want to calculate the mean of intron length of each gene in mouse(mm10) or human(hg19 or hg38). The rationale of doing this is trying to categorize genes into 4 or 5 bins based on their intron length. As the previously published paper "https://www.ncbi.nlm.nih.gov/pubmed/21358643", in Table 1, they classified genes into 4 groups depending on their intron length(0-1kb, 1-10kb,10-100kb,>100kb) and count the gene number in each class. Could anyone give me any suggestion to do this?

genome sequencing • 2.9k views

ADD COMMENT • link updated 4.6 years ago by shawn.w.foley ★ 1.3k • written 4.6 years ago by fabregas23213 • 0

1

Entering edit mode

This question appears to be identical to: Calculating each gene's mean of intron length. It is not a good practice to post the same question using two different accounts.

ADD REPLY • link 4.6 years ago by GenoMax 141k

score 1 · Answer 1 · 2019-09-04

You can obtain a .bed file with gene-body coordinates and a .bed file with exon coordinates from UCSC or Ensembl. After that you can use bedtools subtract to subtract the exons from the gene bodies - this will give you an output file with all of the intron coordinates.

bedtools subtract -a gene_body.bed -b exons.bed -s > introns.bed

After that you can calculate the length and average the size across genes using whichever language you prefer. Personally, I'd perform this in R:

Input format (introns.bed):

chr1    11873   12226   DDX11L1.1   .   +
chr1    12612   12720   DDX11L1.1   .   +
chr1    13220   14408   DDX11L1.1   .   +

I'd run the code:

R --vanilla
inFile <- read.table('introns.bed',header=F,as.is=T,sep='\t')
colnames(inFile) <- c('chr','start','stop','gene','score','strand')
inFile$intronLength <- inFile$stop - inFile$start
meanIntronLength <- aggregate(inFile$intronLength,list(inFile$gene),mean)
write.table(meanIntronLength,'output_meanIntronLength.txt',row.names=F,quote=F,sep='\t')
q()

This will read in the file, calculate the intron length, then aggregate via gene name by taking the mean of the intron lengths. The output file will be a two-column file with gene name and intron length.

GenoMax · Answer 2 · 2019-09-04

using BioAlcidaeJdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

I'm treating each intron of each transcript as a distinct entity.

$ wget -O - -q "http://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz" | gunzip -c |\
java -jar dist/bioalcidaejdk.jar -F GTF -e 'stream().forEach(G->println(G.getId()+"\t"+G.getGeneName()+"\t"+G.getTranscripts().stream().flatMap(T->T.getIntrons().stream()).mapToInt(I->I.getLengthOnReference()).average().orElse(-99) ));' 

ENSMUSG00000029836  Cbx3    2392.5
ENSMUSG00000029837  1700111E14Rik   11957.25
ENSMUSG00000029830  Svopl   3663.7948717948716
ENSMUSG00000005864  Cnga2   3509.5714285714284
ENSMUSG00000029831  Npvf    1478.0
ENSMUSG00000029832  Nfe2l3  8307.166666666666
ENSMUSG00000029833  Trim24  5817.84
ENSMUSG00000029838  Ptn 23120.14285714286
ENSMUSG00000030827  Fgf21   327.0
ENSMUSG00000030825  Hsd17b14    1523.090909090909
ENSMUSG00000030826  Bcat2   1856.8392857142858
ENSMUSG00000030823  9130019O22Rik   533.0
ENSMUSG00000030824  Nucb1   1569.7358490566037
ENSMUSG00000030822  Prr14   622.8695652173913
ENSMUSG00000005871  Apc 6496.068493150685
ENSMUSG00000029823  Luc7l2  5344.425
ENSMUSG00000029826  Zc3hav1 3994.5384615384614
ENSMUSG00000029821  Gsdme   6530.684210526316
ENSMUSG00000029822  Osbpl3  5717.4789915966385
ENSMUSG00000005873  Reep5   8005.071428571428
(...)