Hi All, I have a list of AGI locus and want to get their gene structure in genbank or EMBL format. Since TAIR only give in gff3 format, I want a method either to convert gff3 to genbank/embl or a method to get the NCBI acc.No of those AGI locus. I have found one file ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/TAIR9_NCBI_GENEID_mapping under TAIR but its not completely true ( or I didn't understand it completely)
The file that you describe contains 2 columns; the second is the TAIR locus tag and the first is the NCBI Entrez Gene database ID. The Gene ID is not the same as an accession number or ID, but it will get you there.
There may well be a file, at the Arabidopsis site or elsewhere, which links Gene ID to GenBank accession. If not, you can use BioMart, something like this:
- Click MARTVIEW (top menu)
- Choose "EMSEMBL PLANT 6 (EBI UK)" as database
- Choose "Arabidopsis thaliana genes (TAIR9)" as dataset
- Click "Filters" (left menu); expand GENE; check ID list limit and choose "Entrez Gene ID(s)"
- Either paste or upload Gene IDs (column 1 in your file)
- Click "Attributes" (left menu); expand EXTERNAL; check "RefSeq DNA ID"
- Click "Results" (top left menu)
After some time, this should return results that you can download as plain ASCII text. For example, using Gene ID 2745418 (AT2G01175), I get back "NM_201659".
You can now take your new list of accessions off to Batch Entrez, upload them and retrieve the results in GenBank format.
This is just one solution (relying on both BioMart and Batch Entrez working well); there are plenty of other potential ways to convert between IDs, including programmatic methods.
To make your question a bit more general, what you are asking for is a way to make a Genbank (or EMBL) file based on a GFF file and its associated FASTA sequence file. Solutions to that can be found here: http://biostar.stackexchange.com/questions/2494/gff3-fasta-to-genbank-augustus-training-set