I can't which of the following files were used to create the GFF3 files?
ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
or an entirely different one from the ftp: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/
The gff3 file I am using is the following: ftp://ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz
Thank you! The part that was confusing for me is that when I looked at the coordinates of GFF3 file in comparison to the DNA toplevel fasta and all the exon sections were N.
No problem- very happy to help. All alternative assembly and patch regions have their sequence padded with N's to ensure alignment programs can report the correct index regions.
Which version/ftp-link do you recommend for me to get the chromosome sequences? The one I used was: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz and it appeared to have only Ns for all of the chromosomes. Thank you.
There are always N's in these chromosome records (especially at the beginning/ends of files). They indicate regions of DNA that we know are there but are not sequenceable by currently available technologies.
Yes, the link you gave is the correct link to download the top-level sequences: ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz