Question: Which Ensembl chromosome sequence file is used for the gff3 files?
1
gravatar for O.rka
10 months ago by
O.rka120
O.rka120 wrote:

I can't which of the following files were used to create the GFF3 files?

ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz

or an entirely different one from the ftp: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

The gff3 file I am using is the following: ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz

ADD COMMENTlink modified 10 months ago • written 10 months ago by O.rka120
2
gravatar for Ben_Ensembl
10 months ago by
Ben_Ensembl1.0k
EMBL-EBI
Ben_Ensembl1.0k wrote:

Hi O.rka,

The GFF3 file contains all annotated features on all sequences, including chromsomes, regions not assembled into chromosomes and haplotype/patch regions.

Therefore, the corresponding FASTA files will be one of the toplevel files here: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/

Toplevel sequences unmasked: Homo_sapiens.GRCh37.dna.toplevel.fa.gz

Toplevel soft/hard masked sequences: Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

There are further details and descriptions of the file names in the README: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/README

Best wishes

Ben Ensembl Helpdesk

ADD COMMENTlink written 10 months ago by Ben_Ensembl1.0k

Thank you! The part that was confusing for me is that when I looked at the coordinates of GFF3 file in comparison to the DNA toplevel fasta and all the exon sections were N.

ADD REPLYlink modified 10 months ago • written 10 months ago by O.rka120
1

No problem- very happy to help. All alternative assembly and patch regions have their sequence padded with N's to ensure alignment programs can report the correct index regions.

ADD REPLYlink written 10 months ago by Ben_Ensembl1.0k

Which version/ftp-link do you recommend for me to get the chromosome sequences? The one I used was: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz and it appeared to have only Ns for all of the chromosomes. Thank you.

ADD REPLYlink modified 10 months ago • written 10 months ago by O.rka120

it appeared to have only Ns for all of the chromosomes.

There are always N's in these chromosome records (especially at the beginning/ends of files). They indicate regions of DNA that we know are there but are not sequenceable by currently available technologies.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax73k

Yes, the link you gave is the correct link to download the top-level sequences: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

ADD REPLYlink written 10 months ago by Ben_Ensembl1.0k
0
gravatar for O.rka
10 months ago by
O.rka120
O.rka120 wrote:

Note for anyone trying to do this in the future. From the answers and very helpful conversation above, I got this combination to work:

ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz

but there were a lot of extra bottom drawer scaffolds so I ended up doing the following:

wget ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.chromosome.*.fa.gz
ADD COMMENTlink written 10 months ago by O.rka120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2580 users visited in the last hour