Which Ensembl chromosome sequence file is used for the gff3 files?
2
1
Entering edit mode
3.0 years ago
O.rka ▴ 410

I can't which of the following files were used to create the GFF3 files?

or an entirely different one from the ftp: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

The gff3 file I am using is the following: ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz

ensembl gff3 genome chromosome human • 1.9k views
2
Entering edit mode
3.0 years ago
Ben_Ensembl ★ 1.8k

Hi O.rka,

The GFF3 file contains all annotated features on all sequences, including chromsomes, regions not assembled into chromosomes and haplotype/patch regions.

Therefore, the corresponding FASTA files will be one of the toplevel files here: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

Toplevel sequences unmasked: Homo_sapiens.GRCh37.dna.toplevel.fa.gz

Toplevel soft/hard masked sequences: Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

There are further details and descriptions of the file names in the README: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/README

Best wishes

Ben Ensembl Helpdesk

0
Entering edit mode

Thank you! The part that was confusing for me is that when I looked at the coordinates of GFF3 file in comparison to the DNA toplevel fasta and all the exon sections were N.

1
Entering edit mode

No problem- very happy to help. All alternative assembly and patch regions have their sequence padded with N's to ensure alignment programs can report the correct index regions.

0
Entering edit mode

Which version/ftp-link do you recommend for me to get the chromosome sequences? The one I used was: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz and it appeared to have only Ns for all of the chromosomes. Thank you.

0
Entering edit mode

it appeared to have only Ns for all of the chromosomes.

There are always N's in these chromosome records (especially at the beginning/ends of files). They indicate regions of DNA that we know are there but are not sequenceable by currently available technologies.

0
Entering edit mode

Yes, the link you gave is the correct link to download the top-level sequences: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

0
Entering edit mode
3.0 years ago
O.rka ▴ 410

Note for anyone trying to do this in the future. From the answers and very helpful conversation above, I got this combination to work:

ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz


but there were a lot of extra bottom drawer scaffolds so I ended up doing the following:

wget ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.chromosome.*.fa.gz