I got a bit lost while trying to access gene sequences in the GRCH37. I downloaded CCDS coordinates (I tried datasets from Enseml an NCBI: ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/ http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ccdsGene.txt.gz along with the provided FASTA files of each chromosomes from both databases) though, the provided CCDS coordinates for the genes never match:
I "flatten" the fasta file (e.g. for chromosome 1), so that it doesn't contain any newlines or fasta-headers, lookup the sequence for a given gene from the CCDS file and compare the sequence to an online viewer (for example Ensembl). The sequences never match. Searching for the sequence (which I copy from Ensemble) in my chromosome file, I can find it with a notable offset.
Failing on such a simple task shows that I lack of experience so I kindly ask for two advices.
- How to get/download the genome (GRCh37) with matching gene annotations so I can search (on my machine) for gene sequences?
- I didn't find a free/good edX/Coursera/... course or any other good tutorial. I've got a CS background, so I'm fine with algorithms and Python etc., though is there a good resource online which gives a good overview about datasets and how to work with them?
Best wishes and thanks for replies