I encounter the following problem: I have a list of chloroplast gene IDs and a list of taxonomic IDs, and I want to retrieve for each organism all the nucleotide sequences of the gene list. This seems pretty simple: I "just" want to retrieve all sequences in fasta format. It is not a problem to write a script to iterate over the different request, but even for a simple pair gene ID - organism ID I was not able to find the correct way to do this... I tried the following, for example for the pair ycf2[Gene Name] AND txid4097[Organism] :
- first, I try the NCBI website, but by searching in Gene database (query: ycf2[Gene Name] AND txid4097[Organism]) I did not obtain the sequences, in protein database I obtained only amino acid sequence and in nucleotide database I obtained only the whole chloroplast sequence...
- using Bioperl, I got a "WebEnv parameter is required" error (I don't copy-paste the code, it is similar as explained in http://bioperl.org/howtos/Beginners_HOWTO.html)
- then I tried with e-utilities : esearch -db gene -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format fasta. This works +/- but I only get the names and descriptions, not the nucleotide sequences
- I also tried with esearch to extract the ID of the corresponding protein, but I did not succeed to retrieve the sequence (esearch -db gene -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format xml | grep "<gene-commentary_accession>NP" | head -1 | sed -r 's/<.+>(.+)<\/.+>$/\1/')
- finally, I tried Biopython, but got the same error as already reported (Biopython: Entrez.efetch causes UnboundLocalError)
If anybody can help, I would be very grateful !
Thank you in advance,
EDIT: my problem could be solved if someone could tell me how to emulate via a script or command line the "Send -> "Gene features" -> "file" of a Genbank record such as this one https://www.ncbi.nlm.nih.gov/nuccore /NC_001879.2?report=genbank ??? Thanks.