Question: Calculate the fraction of genome that is feature X
1
gravatar for nr23
2.5 years ago by
nr2380
nr2380 wrote:

I am trying to calculate enrichment of Structural variant breakpoints and SNV locations in genomic features (exon, intron, 5'UTR...) in a non-human genome. To know whether a feature is over/underrepresented in a SV/SNV dataset, I first need to know the fraction of the genome that is feature X. For example the total length of all exons across the genome.

Is there an existing resource that has this sort of information (for annotated genomes like Drosophila)? If not, is there an R package that can calculate this?

Secondly, it's not clear to me how this is calculated in the first place. If a gene has 10 transcript variants, how do we calculate that gene's total exotic region? Total exon length / 10? Add up all the exons from the longest transcript variant? Take the longest possible transcript length (longest exon1 + longest exon2)? I'd love to know how this is usually calculated.

I've had a look at GenomicFeatures, but it's not clear to me how I would go about doing this using this package. Any suggestions/advice would be very welcome.

R genome • 1.3k views
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by nr2380
5
gravatar for James Ashmore
2.5 years ago by
James Ashmore2.8k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.8k wrote:

Here's an example to calculate the percentage of exonic nucleotides in the Drosophila genome. For this answer you'll need the relevant TxDb package installed:

# Load relevant package
library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")

# Create a GRanges object containing the chromosome ranges
chromInfo <- seqinfo(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
chromRanges <- GRanges(seqnames(chromInfo), IRanges(start = 1, end = seqlengths(chromInfo)))

# Create a GRanges object containing the merged exon ranges
exonRanges <- exons(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
exonRanges <- reduce(exonRanges)

# Get the widths of the exon and chromosome ranges
exonWidths <- width(exonRanges)
chromWidths <- width(chromRanges)

# Calculate the total width of the exon and chromosome ranges
exonTotal <- sum(exonWidths)
chromTotal <- sum(chromWidths)

# Calculate the percentage of exonic nucleotides
pctExon <- (exonTotal / chromTotal) * 100

To retrieve different features you can use the various functions available in the GenomicFeatures package. To save you some time here are the main ones:

# Transcript database
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene

# Transcript regions
transcripts <- transcripts(txdb)

# Exon regions
exons <- exons(txdb)

# Intron regions
introns <- intronsByTranscript(txdb)
introns <- unlist(introns)

# 5 prime UTR
utr5 <- fiveUTRsByTranscript(txdb)
utr5 <- unlist(utr5)

# 3 prime UTR
utr3 <- threeUTRsByTranscript(txdb)
utr3 <- unlist(utr3)

# CDS regions
cds <- cds(txdb)

# Gene regions
genes <- genes(txdb)

# Promoter regions
promoters <- promoters(txdb, upstream = 500, downstream = 100)

# Intergenic regions (Credit to Vince Buffalo and his book for this one)
tx <- reduce(transcripts)
strand(tx) <- "*"
intergenic <- setdiff(chromRanges, tx)
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by James Ashmore2.8k

This looks like it will work for exons but not other features (5'UTR, start codon, intron) etc. Is this the case?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by nr2380
1

I've updated my answer with methods to get different features.

ADD REPLYlink written 2.5 years ago by James Ashmore2.8k

Thanks for taking the time to spell this out!

ADD REPLYlink written 2.5 years ago by nr2380

I have a follow up question: I would need to calculate the fraction of several features c("intergenic","promoter","exon","intron"). How would you do this, if the fraction has sum up to 1. With your proposed method you only end up with fractions, which sum up with values above 1, because multiple features are assigned to one region. Thanks! Any advice appreciated!

ADD REPLYlink written 2.3 years ago by br.m0
0
gravatar for Alex Reynolds
2.5 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

To do this on the command line, get your exons into a sorted BED file, using Unix tools and BEDOPS convert2bed:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "exon"' \
    | convert2bed -i gff - \
    > exons.bed

Get your chromosome bounds into a sorted BED file, using Kent tools fetchChromSizes and BEDOPS sort-bed:

$ fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' | sort-bed - > hg38.bed

Use BEDOPS bedmap to map the sizes of merged (overlapping) exons over the hg38 genomic space. You can calculate the fraction (percentage/100) of space taken up by exons by dividing by the size of the genome build:

$ hg38Size=`awk '{s+=$3;}END{print s;}' hg38.bed`
$ bedops --merge exons.bed | bedmap --echo-ref-size - hg38.bed | awk -vhg38=${hg38Size} '{s+=$1;}END{print s/hg38;}'

This can be repeated for most other annotation categories.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 976 users visited in the last hour