Which Ensembl chromosome sequence file is used for the gff3 files?
3.0 years ago
O.rka ▴ 410

I can't which of the following files were used to create the GFF3 files?

or an entirely different one from the ftp: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

The gff3 file I am using is the following: ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz

ensembl gff3 genome chromosome human • 1.9k views
3.0 years ago
Ben_Ensembl ★ 1.8k

Hi O.rka,

The GFF3 file contains all annotated features on all sequences, including chromsomes, regions not assembled into chromosomes and haplotype/patch regions.

Therefore, the corresponding FASTA files will be one of the toplevel files here: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

Toplevel sequences unmasked: Homo_sapiens.GRCh37.dna.toplevel.fa.gz

Toplevel soft/hard masked sequences: Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

There are further details and descriptions of the file names in the README: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/README

Best wishes

Ben Ensembl Helpdesk

Thank you! The part that was confusing for me is that when I looked at the coordinates of GFF3 file in comparison to the DNA toplevel fasta and all the exon sections were N.

No problem- very happy to help. All alternative assembly and patch regions have their sequence padded with N's to ensure alignment programs can report the correct index regions.

Which version/ftp-link do you recommend for me to get the chromosome sequences? The one I used was: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz and it appeared to have only Ns for all of the chromosomes. Thank you.

it appeared to have only Ns for all of the chromosomes.

There are always N's in these chromosome records (especially at the beginning/ends of files). They indicate regions of DNA that we know are there but are not sequenceable by currently available technologies.

Yes, the link you gave is the correct link to download the top-level sequences: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

3.0 years ago
O.rka ▴ 410

Note for anyone trying to do this in the future. From the answers and very helpful conversation above, I got this combination to work:

ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz


but there were a lot of extra bottom drawer scaffolds so I ended up doing the following:

wget ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.chromosome.*.fa.gz