Which Ensembl chromosome sequence file is used for the gff3 files?
2
2
Entering edit mode
5.9 years ago
Ben Moore ★ 2.4k

Hi O.rka,

The GFF3 file contains all annotated features on all sequences, including chromsomes, regions not assembled into chromosomes and haplotype/patch regions.

Therefore, the corresponding FASTA files will be one of the toplevel files here: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

Toplevel sequences unmasked: Homo_sapiens.GRCh37.dna.toplevel.fa.gz

Toplevel soft/hard masked sequences: Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

There are further details and descriptions of the file names in the README: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/README

Best wishes

Ben Ensembl Helpdesk

ADD COMMENT
0
Entering edit mode

Thank you! The part that was confusing for me is that when I looked at the coordinates of GFF3 file in comparison to the DNA toplevel fasta and all the exon sections were N.

ADD REPLY
1
Entering edit mode

No problem- very happy to help. All alternative assembly and patch regions have their sequence padded with N's to ensure alignment programs can report the correct index regions.

ADD REPLY
0
Entering edit mode

Which version/ftp-link do you recommend for me to get the chromosome sequences? The one I used was: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz and it appeared to have only Ns for all of the chromosomes. Thank you.

ADD REPLY
0
Entering edit mode

it appeared to have only Ns for all of the chromosomes.

There are always N's in these chromosome records (especially at the beginning/ends of files). They indicate regions of DNA that we know are there but are not sequenceable by currently available technologies.

ADD REPLY
0
Entering edit mode

Yes, the link you gave is the correct link to download the top-level sequences: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

ADD REPLY
0
Entering edit mode
5.9 years ago
O.rka ▴ 740

Note for anyone trying to do this in the future. From the answers and very helpful conversation above, I got this combination to work:

ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz

but there were a lot of extra bottom drawer scaffolds so I ended up doing the following:

wget ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.chromosome.*.fa.gz
ADD COMMENT

Login before adding your answer.

Traffic: 1716 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6