Obtaining Ucsc Tables Via Ftp And Converting Them To Proper Gff3 Via Genepredtogtf?
2
2
Entering edit mode
12.2 years ago
user ▴ 950

My goal is to get a UCSC table in GTF format from the FTP database and convert it to GFF3 format. My strategy is to convert the UCSC table to GTF and then to GFF3 - unless there is an easier way?

Through UCSC's Tables website it's possible to obtain tables like the Ensembl table in GTF format. I'd like to get the same table via FTP "Annotation" download, but I do not see these tables there. For example for mm9: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/

The table I'm interested in is Ensembl gene, ensGene.txt, which is not listed to GTF format. How can it be converted to GTF format?

I'd like to be a GTF format where the root nodes are the Ensembl gene entries like ENSMUSG.... and the transcript are children nodes. I think this might be possible with genePredToGtf but cannot get it to work. The following command fails:

cat ensGene.txt | cut -f2-11 | genePredToGtf file stdin foo.gtf

Anyone know how this can be corrected?

Also, the UCSC tables format seems to be a 0-based start. Does genePredToGtf take care of making the resulting GTF 1-based?

Once I have a GTF, I can convert it to GFF3 format. Is there a utility that goes directly from genePred to GFF3, which would save this headache? I tried GBrowse's ucsc_genes2gff.pl (available here: http://search.cpan.org/~lds/GBrowse-2.52/bin/bed2gff3.pl) but it does not generate gene entries, only mRNA/children entries and ignores the ENSMUSG identifier.

Thanks

ucsc genes • 10.0k views
ADD COMMENT
3
Entering edit mode
10.5 years ago
Kamil ★ 2.3k

I copied this answer directly from:

http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format


UCSC keeps gene structures in a text format with all information about a gene in one line: GenePred format.

Convert genePred to GTF with the genePredToGtf kent command utility.

At this time, genePredToGtf provides better GTF files than available from the table browser.

To use the kent commands with the public database server, create ".hg.conf" in your home directory:

$ cat $HOME/.hg.conf
db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password
central.db=hgcentral

And set the permissions:

$ chmod 600 .hg.conf

Now use the command get GTF directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file knownGene.gtf:

$ genePredToGtf hg19 knownGene knownGene.gtf

Note the usage message from the command:

genePredToGtf - Convert genePred table or file to gtf.
usage:
    genePredToGtf database genePredTable output.gtf
If database is 'file' then track is interpreted as a file
rather than a table in database.
options:
   -utr            Add 5UTR and 3UTR features.
   -honorCdsStat   Use cdsStartStat/cdsEndStat when defining start/end codon records.
   -source=src     Set source name to uses.
   -addComments    Add comments before each set of transcript records.
                   Allows for easier visual inspection.
Note: Use refFlat or extended genePred table to include geneName
ADD COMMENT
1
Entering edit mode
12.2 years ago

You can use the table browser to get the ensemble annotations in gtf format: http://genome.ucsc.edu/cgi-bin/hgTables?command=start

Just set it to ensemble genes, use the ensGene table, and output to gtf.

ADD COMMENT

Login before adding your answer.

Traffic: 2163 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6