I got a bit lost while trying to access gene sequences in the GRCH37. I downloaded CCDS coordinates (I tried datasets from Enseml an NCBI: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/ http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ccdsGene.txt.gz along with the provided FASTA files of each chromosomes from both databases) though, the provided CCDS coordinates for the genes never match:

I "flatten" the fasta file (e.g. for chromosome 1), so that it doesn't contain any newlines or fasta-headers, lookup the sequence for a given gene from the CCDS file and compare the sequence to an online viewer (for example Ensembl). The sequences never match. Searching for the sequence (which I copy from Ensemble) in my chromosome file, I can find it with a notable offset.

Failing on such a simple task shows that I lack of experience so I kindly ask for two advices.

  • How to get/download the genome (GRCh37) with matching gene annotations so I can search (on my machine) for gene sequences?
  • I didn't find a free/good edX/Coursera/... course or any other good tutorial. I've got a CS background, so I'm fine with algorithms and Python etc., though is there a good resource online which gives a good overview about datasets and how to work with them?

Best wishes and thanks for replies

Can you please give an example of a CCDS where it doesn't seem to match up. We can use this to try to work out what's going wrong.

You can get them from RefSeq, UCSC, Ensembl etc. Here is Gencode link to hg37. You can download genome sequence, gene GTF files that match in annotation. BTW, hg38 is also available now. You can get the link to hg38 on the same portal.


