Calculate the fraction of genome that is feature X
2
1
Entering edit mode
5.4 years ago
nanana ▴ 110

I am trying to calculate enrichment of Structural variant breakpoints and SNV locations in genomic features (exon, intron, 5'UTR...) in a non-human genome. To know whether a feature is over/underrepresented in a SV/SNV dataset, I first need to know the fraction of the genome that is feature X. For example the total length of all exons across the genome.

Is there an existing resource that has this sort of information (for annotated genomes like Drosophila)? If not, is there an R package that can calculate this?

Secondly, it's not clear to me how this is calculated in the first place. If a gene has 10 transcript variants, how do we calculate that gene's total exotic region? Total exon length / 10? Add up all the exons from the longest transcript variant? Take the longest possible transcript length (longest exon1 + longest exon2)? I'd love to know how this is usually calculated.

I've had a look at GenomicFeatures, but it's not clear to me how I would go about doing this using this package. Any suggestions/advice would be very welcome.

R genome • 3.0k views
6
Entering edit mode
5.4 years ago
James Ashmore ★ 3.3k

Here's an example to calculate the percentage of exonic nucleotides in the Drosophila genome. For this answer you'll need the relevant TxDb package installed:

# Load relevant package
library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")

# Create a GRanges object containing the chromosome ranges
chromInfo <- seqinfo(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
chromRanges <- GRanges(seqnames(chromInfo), IRanges(start = 1, end = seqlengths(chromInfo)))

# Create a GRanges object containing the merged exon ranges
exonRanges <- exons(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
exonRanges <- reduce(exonRanges)

# Get the widths of the exon and chromosome ranges
exonWidths <- width(exonRanges)
chromWidths <- width(chromRanges)

# Calculate the total width of the exon and chromosome ranges
exonTotal <- sum(exonWidths)
chromTotal <- sum(chromWidths)

# Calculate the percentage of exonic nucleotides
pctExon <- (exonTotal / chromTotal) * 100


To retrieve different features you can use the various functions available in the GenomicFeatures package. To save you some time here are the main ones:

# Transcript database
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene

# Transcript regions
transcripts <- transcripts(txdb)

# Exon regions
exons <- exons(txdb)

# Intron regions
introns <- intronsByTranscript(txdb)
introns <- unlist(introns)

# 5 prime UTR
utr5 <- fiveUTRsByTranscript(txdb)
utr5 <- unlist(utr5)

# 3 prime UTR
utr3 <- threeUTRsByTranscript(txdb)
utr3 <- unlist(utr3)

# CDS regions
cds <- cds(txdb)

# Gene regions
genes <- genes(txdb)

# Promoter regions
promoters <- promoters(txdb, upstream = 500, downstream = 100)

# Intergenic regions (Credit to Vince Buffalo and his book for this one)
tx <- reduce(transcripts)
strand(tx) <- "*"
intergenic <- setdiff(chromRanges, tx)

0
Entering edit mode

This looks like it will work for exons but not other features (5'UTR, start codon, intron) etc. Is this the case?

1
Entering edit mode

I've updated my answer with methods to get different features.

0
Entering edit mode

Thanks for taking the time to spell this out!

0
Entering edit mode

I have a follow up question: I would need to calculate the fraction of several features c("intergenic","promoter","exon","intron"). How would you do this, if the fraction has sum up to 1. With your proposed method you only end up with fractions, which sum up with values above 1, because multiple features are assigned to one region. Thanks! Any advice appreciated!

0
Entering edit mode

Hello everyone,

I think that if we don't mind strands, we should tell reduce() using ignore.strand=TRUE:

exonRanges <- reduce(exonRanges, ignore.strand=TRUE)


If we don't do that we'll have a result slightly higher because of those positions where we have exons regions overlapped on strands.

Also, if what we want to know is the percentage of nucleotides in a chromosome or the entire genome, then we should consider both strands when merging exons ranges and double the length of the chromosome or genome (because we want to know the percentage of nucleotides covered by exons):

# We set ignore.strand=FALSE which is the default option
exonRanges <- reduce(exonRanges, ignore.strand=FALSE)
chr1_nucleotides <- seqlengths(txdb)["chr1"]*2
# We double the length of the genome to count all nucleotides
chromRanges <- GRanges(seqnames(chromInfo), IRanges(start = 1, end = seqlengths(chromInfo)*2))


The rest will be the same as commented.

You can check: https://github.com/tmontserrat/proportion_exon_regions

0
Entering edit mode
5.4 years ago

To do this on the command line, get your exons into a sorted BED file, using Unix tools and BEDOPS convert2bed:

$wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.annotation.gff3.gz \ | gunzip --stdout - \ | awk '$3 == "exon"' \
| convert2bed -i gff - \
> exons.bed


Get your chromosome bounds into a sorted BED file, using Kent tools fetchChromSizes and BEDOPS sort-bed:

$fetchChromSizes hg38 | awk '{ print$1"\t0\t"$2; }' | sort-bed - > hg38.bed  Use BEDOPS bedmap to map the sizes of merged (overlapping) exons over the hg38 genomic space. You can calculate the fraction (percentage/100) of space taken up by exons by dividing by the size of the genome build: $ hg38Size=awk '{s+=$3;}END{print s;}' hg38.bed$ bedops --merge exons.bed | bedmap --echo-ref-size - hg38.bed | awk -vhg38=${hg38Size} '{s+=$1;}END{print s/hg38;}'


This can be repeated for most other annotation categories.