How to add gene annotation to a UCSC assembly hub?
1
4
Entering edit mode
9.1 years ago
Ian 6.0k

I am making my first UCSC assembly hub to display a non-UCSC annotated genome within the browser. All is well except that I cannot work out how to add the gene annotation, which is currently in GFF3 format. I am aware that track hubs only except the "big" file versions, so presumably a bigBed version of the annotation is required. Does anyone know of a handy method of converting GFF3 to BED/bigBED? I think BED12 is required I to retain the differentiation between CDS, UTR and introns...

Thank you!

P.S. I have Googled this! Convert .Gff3 File To 12-Column .Bed File is a help, but I would be interested to know if there have been developments since then.

EDIT: GTF or GFF2 can be used for gene annotation!

assembly hub gtf gff3 UCSC • 3.5k views
ADD COMMENT
4
Entering edit mode
9.1 years ago
Ian 6.0k

In the end I contacted UCSC browser directly. I got a helpful and detailed reply that I have edited to make it clearer how the necessary programs can be obtained. This is run in 64bit Linux. IMPORTANT NOTE: my question specified GFF3 as the starting format for the annotation, but it appeared to be much easy using GTF / GFF2.

Fetch the programs

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitInfo
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/extractGtf.pl
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredCheck
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ixIxx
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
chmod +x genePredToBed genePredToBed genePredCheck bedToBigBed faToTwoBit twoBitInfo ixIxx

Download Perl scripts from their GIT repository

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=tree;f=src/hg/utils/automation

extractGtf.pl
ensemblInfo.pl

Method

# Create twoBit version of genome
faToTwoBit genome.fa genome.2bit

# Get chromosome length from twoBit genome
twoBitInfo genome.2bit stdout | sort -k2rn > genome.chrom.sizes

# Convert GTF annotation to genePred format
gtfToGenePred -infoOut=infoOut.txt -genePredExt genome.gtf genome.gp

# Check the genePred output is valid
genePredCheck genome.gp

# Convert genePred format to BED format
genePredToBed genome.gp stdout | sort -k1,1 -k2,2n > genome.bed

# Convert BED to bigBed
# extraIndex required for position/search
bedToBigBed -type=bed12 -extraIndex=name genome.bed genome.chrom.sizes genome.bb

# Required for indexing step
grep -v "^#" infoOut.txt | awk '{printf "%s\t%s,%s,%s,%s,%s\n", $1,$2,$3,$8,$9,$10}' > genome.nameIndex.txt

# Create index for position/search function in browser
ixIxx genome.nameIndex.txt genome.nameIndex.ix genome.nameIndex.ixx
ADD COMMENT

Login before adding your answer.

Traffic: 1990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6