Question

Which Ensembl chromosome sequence file is used for the gff3 files?

1

Entering edit mode

5.9 years ago

O.rka ▴ 740

I can't which of the following files were used to create the GFF3 files?

ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz

or an entirely different one from the ftp: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

The gff3 file I am using is the following: ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz

ensembl gff3 genome chromosome human • 3.6k views

ADD COMMENT • link 5.9 years ago by O.rka ▴ 740

0

Entering edit mode

5.9 years ago

O.rka ▴ 740

Note for anyone trying to do this in the future. From the answers and very helpful conversation above, I got this combination to work:

ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz

but there were a lot of extra bottom drawer scaffolds so I ended up doing the following:

wget ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.chromosome.*.fa.gz

ADD COMMENT • link 5.9 years ago by O.rka ▴ 740

score 2 · Accepted Answer · 2018-12-19

2

Entering edit mode

5.9 years ago

Ben Moore ★ 2.4k

Hi O.rka,

The GFF3 file contains all annotated features on all sequences, including chromsomes, regions not assembled into chromosomes and haplotype/patch regions.

Therefore, the corresponding FASTA files will be one of the toplevel files here: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

Toplevel sequences unmasked: Homo_sapiens.GRCh37.dna.toplevel.fa.gz

Toplevel soft/hard masked sequences: Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

There are further details and descriptions of the file names in the README: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/README

Best wishes

Ben Ensembl Helpdesk

ADD COMMENT • link 5.9 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Thank you! The part that was confusing for me is that when I looked at the coordinates of GFF3 file in comparison to the DNA toplevel fasta and all the exon sections were N.

ADD REPLY • link 5.9 years ago by O.rka ▴ 740

1

Entering edit mode

No problem- very happy to help. All alternative assembly and patch regions have their sequence padded with N's to ensure alignment programs can report the correct index regions.

ADD REPLY • link 5.9 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Which version/ftp-link do you recommend for me to get the chromosome sequences? The one I used was: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz and it appeared to have only Ns for all of the chromosomes. Thank you.

ADD REPLY • link 5.9 years ago by O.rka ▴ 740

0

Entering edit mode

it appeared to have only Ns for all of the chromosomes.

There are always N's in these chromosome records (especially at the beginning/ends of files). They indicate regions of DNA that we know are there but are not sequenceable by currently available technologies.

ADD REPLY • link 5.9 years ago by GenoMax 146k

0

Entering edit mode

Yes, the link you gave is the correct link to download the top-level sequences: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

ADD REPLY • link 5.9 years ago by Ben Moore ★ 2.4k