Hi all, I would like to download all of the longest transcripts for protein coding sequences for the vervet genome (assembly number 132581 in NCBI) in fasta format.
I used the command:
elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq |efetch -db nuccore -format fasta_cds_na >> chlorocebus_vervet.genes
where "id" is the assembly ID for the species.
- When I type just elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq, the output is (I have had to slightly amend the format, but the numbers have not been altered):
Maybe I'm not understanding, but is this telling me that there are 7,156 protein coding sequences being downloaded? When I manually search for "132581" in the NCBI assembly database, the output is: https://www.ncbi.nlm.nih.gov/assembly/GCF_000409795.2/. Then when I click on https://www.ncbi.nlm.nih.gov/genome/?term=txid60711[orgn] and "Gene", and click "protein coding category"; I can see that there are 20,633 records that I would like the fasta sequences for.
- When I run this command (elink -db assembly -target nuccore -id 132581 -name assembly_nuccore_refseq |efetch -db nuccore -format fasta_cds_na >> chlorocebus_vervet.genes) two things happen: (a) Not all 7,156 records download, the number stops at somewhere between 150-400. (b) There are large uneven gaps (i.e. empty lines) randomly in the fasta output. I tried to copy and paste to here, but the large gaps are automatically stripped.
Could someone please tell me what I am doing wrong, and how I retrieve all of the longest transcripts for 20,633 protein coding sequences for this species.