Question: GFF3 to 12-column BED
1
gravatar for New2R
2.6 years ago by
New2R20
New2R20 wrote:

I am trying to get refseq annotation (txStart, txEnd, cdsStart, cdsEnd, exonStarts, exonsEnds etc) from NCBI, in a format that is downloadable from UCSC Table Browser. The data from UCSC is in a table format (12 column BED) and easy to parse and manipulate. The NCBI data is in GFF3 format that I am not familiar with, and do not know how to extract annotation from GFF3.

Galaxy Browser is able to convert GFF3 to 12-column BED file, shared here Galaxy_refSeq_GFF3_to_BED. The 12-column BED output Galaxy Browser is similar to the UCSC format, but data in some columns are not clear.

The columns in the GGF3-> BED are

  1. chromosome name (NC_00000) - similar to UCSC hg19.refGene.chrom (can be easy converted to "chr" format)
  2. Start - same as UCSC hg19.refGene.txStart
  3. End - same as UCSC hg19.refGene.txEnd
  4. NM number with version - similar hg19.refGene.name, but without version no
  5. Score - similar to UCSC hg19.refGene.score
  6. Strand - same as UCSC hg19.refGene.strand
  7. Start - not same as UCSC hg19.refGene.cdsStart (but same as values in GFF3 column 2)
  8. End - not same as hg19.refGene.cdsEnd (but same as values in GFF3 column 3)
  9. Unknown column
  10. No of exons - similar to UCSC hg19.refGene.exonCount (but seems to be UCSC hg19.refGene.exonCount + 1)
  11. Exon sizes in bp - the size of each exon matches that in UCSC, but there is one extra number in the front
  12. Unknown set of numbers - matches the UCSC hg19.refGene.exonCount + 1 numbers. i.e., if a gene has 79 exons, there are 80 numbers in this column (like column 11).

My questions are for the following columns

  1. Is cdsStart available in GFF3?
  2. Is cdsEnd available in GFF3?
  3. What is coded here?
  4. Why is there always hg19.refGene.exonCount + 1 numbers? i.e., DMD which has 79 exons, shows up as 80 in GFF3
  5. The exons sizes match as represented in UCSC, except for the presence of a first large number What is this first number?
  6. A decreasing number is found. Not clear what this is. If we know the size of each exon (column 11), then the only item needed to define the exon structure are the start positions. but can't make sense of the numbers I am seeing in column 12.

Thanks in advance for any help.

bed gff3 • 1.8k views
ADD COMMENTlink modified 21 months ago by t_pod10 • written 2.6 years ago by New2R20
0
gravatar for t_pod
21 months ago by
t_pod10
t_pod10 wrote:

Hello, I have a similar issue than you. As the annotation file was not available on UCSC, I've downloaded the gff3 file from NCBI and converted it to BED12 file via Galaxy.

However, I am finding many discrepancies in my converted BED12 file when I compare it to as "correct" BED12 file from UCSC:

5th and 9th columns are always "0" ( that might not be an issue),

in total 50% of lines have only the first 6 columns filled and they are just named "CDS" in the 4th column without further specification (should I discard all of them?),

and sometimes the column 11 displays n+1 items, where the additional item is "000" and n= blocks count (=column 10). I am expecting column 11 and 12 to be equal.

Any advice?

Thanks

ADD COMMENTlink written 21 months ago by t_pod10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 847 users visited in the last hour