Arabidopsis GFF files
2
1
Entering edit mode
9.4 years ago
biolab ★ 1.4k

Dear all,

Does anyone who works on Arabidopsis know what's the differences between TAIR10_gff3_genes.gff and Spliced_Junctions_clustered.gff. Both are downloaded from ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/

I found the former has 217183 exon records, while the latter has 406034 exon lines. Is the first experimentally confirmed and the latter in silico predicted?

Thanks for your advice!

Arabidopsis gff • 5.3k views
ADD COMMENT
4
Entering edit mode
8.6 years ago
zoo741 ▴ 70

Although this question was brought up a while ago, I thought it's worth giving some updates. Both of these 2 GFF3 files are based on experimental evidence yet took different routes. The details are summarized as follows,

Spliced_Junctions_Clustered.gff

We used high-throughput RNA sequencing data (RNA-seq data) from the Ecker and Mockler labs, and used alignment tools called Tophat and Supersplat to align these sequences to the Arabidopsis genome, resulting in 203,000 clustered spliced RNA-seq junctions.

Cited from ref1

TAIR10_gff3_genes.gff

utilized RNA-seq, proteomic datasets, gene models provided by NCBI and manually curated gene models from Swiss-Prot. It went through multiple steps including mapping, assembling, gene model construction etc. ref2

In summary, one could see Spliced_Junctions_Clustered.gff as the direct output of TopHat, likely junctions.bed, for RNA-seq data sets. On the other hand, TAIR10_gff3_genes.gff used the datasets described above to update the gene models. There are more exon records in TAIR10_gff3_genes.gff because it includes the junctions for ALL the transcripts, many of which may not be detected in the tissue used to generate Spliced_Junctions_Clustered.gff. On the other hand, there should be a bunch of junctions unique to Spliced_Junctions_Clustered.gff because those junctions may not go through the pipeline for TAIR10_gff3_genes.gff, i.e. did not assemble into a transcript.

ADD COMMENT
3
Entering edit mode
9.4 years ago
biogirl ▴ 210

This webpage tells you how the annotation was done. Therefore, I imagine the TAIR10_gff3_genes.gff contains a mixture of models and experimental data. The Spliced_Junctions_clustered.gff file will contain the splice variants as predicted by gene models.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your comments!

ADD REPLY

Login before adding your answer.

Traffic: 2485 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6