I try to get all sequences for a large (about 17.000) list of NCBI gene Ids.
What I figured is one can use http://www.ensembl.org/biomart/martview/7cf551cc4abf51e75bb4f1d84477681e and then download the sequences but this is slow and only works for 500 gene ids at a time.
Is there somewhere a database / file available which I can use to retrieve the nucleotide sequences for the corresponding genes?
I figured there is https://www.ncbi.nlm.nih.gov/sites/batchentrez which can yield for genes:
gene_id -> genomic_nucleotide_accession.version:start_position_on_the_genomic_accession-end_position_on_the_genomic_accession"
Then I used the "nucleotide" batch search. But when I enter e.g.
An illegal character in a token. Possible wrong file format. Request processing canceled.
I figured one can download the genome from here ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/current/GCF_000001405.39_GRCh38.p13/ but I am lost a bit here. In the readme it states
*_genomic.fna.gz file FASTA format of the genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case (see below).
So I downloaded this file and I also grepped a bit in the file and can view for example NC_000003.12. But now I need also to retrieve the positions which is super slow when doing it the neive waw for example in python.
So my question is: How should I approach this task?