Question: Obtaining Ucsc Tables Via Ftp And Converting Them To Proper Gff3 Via Genepredtogtf?
1
gravatar for user
7.1 years ago by
user820
United States
user820 wrote:

My goal is to get a UCSC table in GTF format from the FTP database and convert it to GFF3 format. My strategy is to convert the UCSC table to GTF and then to GFF3 - unless there is an easier way?

Through UCSC's Tables website it's possible to obtain tables like the Ensembl table in GTF format. I'd like to get the same table via FTP "Annotation" download, but I do not see these tables there. For example for mm9:

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/

The table I'm interested in is Ensembl gene, ensGene.txt, which is not listed to GTF format. How can it be converted to GTF format?

I'd like to be a GTF format where the root nodes are the Ensembl gene entries like ENSMUSG.... and the transcript are children nodes. I think this might be possible with genePredToGtf but cannot get it to work. The following command fails:

cat ensGene.txt | cut -f2-11 | genePredToGtf file stdin foo.gtf

Anyone know how this can be corrected?

Also, the UCSC tables format seems to be a 0-based start. Does genePredToGtf take care of making the resulting GTF 1-based?

Once I have a GTF, I can convert it to GFF3 format. Is there a utility that goes directly from genePred to GFF3, which would save this headache? I tried GBrowse's ucsc_genes2gff.pl (available here http://search.cpan.org/~lds/GBrowse-2.52/bin/bed2gff3.pl) but it does not generate gene entries, only mRNA/children entries and ignores the ENSMUSG identifier.

thanks.

bioinformatics genes ucsc • 7.0k views
ADD COMMENTlink modified 5.5 years ago by Kamil1.9k • written 7.1 years ago by user820
3
gravatar for Kamil
5.5 years ago by
Kamil1.9k
Boston
Kamil1.9k wrote:

I copied this answer directly from:

http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format


UCSC keeps gene structures in a text format with all information about a gene in one line: GenePred format.

Convert genePred to GTF with the genePredToGtf kent command utility.

At this time, genePredToGtf provides better GTF files than available from the table browser.

To use the kent commands with the public database server, create ".hg.conf" in your home directory:

$ cat $HOME/.hg.conf
db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password
central.db=hgcentral

And set the permissions:

$ chmod 600 .hg.conf

Now use the command get GTF directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file knownGene.gtf:

$ genePredToGtf hg19 knownGene knownGene.gtf

Note the usage message from the command:

genePredToGtf - Convert genePred table or file to gtf.
usage:
    genePredToGtf database genePredTable output.gtf
If database is 'file' then track is interpreted as a file
rather than a table in database.
options:
   -utr            Add 5UTR and 3UTR features.
   -honorCdsStat   Use cdsStartStat/cdsEndStat when defining start/end codon records.
   -source=src     Set source name to uses.
   -addComments    Add comments before each set of transcript records.
                   Allows for easier visual inspection.
Note: Use refFlat or extended genePred table to include geneName
ADD COMMENTlink written 5.5 years ago by Kamil1.9k
1
gravatar for Damian Kao
7.1 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can use the table browser to get the ensemble annotations in gtf format: http://genome.ucsc.edu/cgi-bin/hgTables?command=start

Just set it to ensemble genes, use the ensGene table, and output to gtf.

ADD COMMENTlink written 7.1 years ago by Damian Kao15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1842 users visited in the last hour