Determining LOC coordinate from GFF3 start column
0
0
Entering edit mode
2.7 years ago
Simon • 0

Hi all, total noob question:

I have a GFF3 file of a pepper (C. annuum) plant genome that looks like this:

seqid   src     type    start   end 
chr01   PROTEIN gene    29119   37617   .       -       .       ID=CA.PGAv.1.6.scaffold567.122
chr01   PROTEIN mRNA    29119   37617   .       -       .       ID=TC.CA.PGAv.1.6.scaffold567.122;Parent=CA.PGAv.1.6.scaffold567.122
chr01   PROTEIN exon    29119   29457   .       -       0       Parent=TC.CA.PGAv.1.6.scaffold567.122
...
chr02   ABINITI gene    157637  159805  0.22    -       .       ID=CA.PGAv.1.6.scaffold1545.2
...
chr04   ISGAP   gene    11689   14256   1096    +       .       ID=CA.PGAv.1.6.scaffold638.93
...

I am trying to cross-reference the features in the GFF3 with the genes from this paper which identifies the locations with numbers such as "LOC107867643", "LOC107868281" etc which I'm assuming are the absolute coordinates in their aligned sequence.

I'm assuming the "start" column is relative to the location of the seqid (because chr04 for example has a start less than chr02) and the spec.

My question is: how then do I translate the chr02 start 157637 for example to an absolute coordinate I can match up relative to the LOC numbers published in the paper?

For example, if the last feature for chr01 has an "end" of 309042759 and the first feature for chr02 has a "start" of 157637 can I just do 309042759 + 157637 = 309200396 to get the whole genome coordinate for that feature?

I found this Biostars question that noted if the chromosome was listed in the file it would start with 1 but I do not have any such entries in this file.

Any help would be great thanks

GFF3 • 955 views
ADD COMMENT
0
Entering edit mode

because chr04 for example has a start less than chr02

Numbering for each chromosome should re-start with number 1. So that is not a problem. Looks like chr04 has a feature annotated earlier than chr02.

ADD REPLY
0
Entering edit mode

thanks for the reply max, so when a paper quotes a gene at "LOC107867643" for example, is that usually the coordinate from the beginning of the entire alignment ie 1 of chr01? Or is it from the beginning of a chromosome and for that reason I need to know what chromosome it is too?

ADD REPLY
1
Entering edit mode

AFAIK LOC id's have no relation to the chromosome at NCBI. They are ID's assigned to gene's of unknown function.

When a published symbol is not available, and orthologs have not yet been determined, Gene will provide a symbol that is constructed as 'LOC' + the GeneID. This is not retained when a replacement symbol has been identified, although queries by the LOC term are still supported. In other words, a record with the symbol LOC12345 is equivalent to GeneID = 12345. So if the symbol changes, the record can still be retrieved on the web using LOC12345 as a query, or from any file using GeneID = 12345.

ADD REPLY
0
Entering edit mode

Ahhh don't know how I missed that. Thanks Max this is what I needed to know

ADD REPLY

Login before adding your answer.

Traffic: 2077 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6