Hi, I am really struggling to figure out how to obtain genome assembly IDs from genbank for a batch of protein accessions.
I have locally downloaded a set of several thousand bacterial genomes, which I am concatenating and using to score with a set of custom HMMs using HMMer. I have been using the _translated_cds.faa files from NCBI's FTP, when available, since they have more information in the header lines (compared to the _protein.faa files). However, neither of these have any information which links a protein hit back to the genome assembly from which it was obtained!
How can I map my hits back to the original assemblies they came from? This seems so crazy that the translated_CDS headers don't have any information about the actual genome/assembly/organism which they came from!!!
My best bet so far has been to look at each individual .faa file (before I'd concatenated them), since the assembly ID is present in the filename.
(For example: GCA_010091945.1_ASM1009194v1_translated_cds.faa >>> this supplies me with GCA_010091945.1, which I can then use to link to a table with all the metadata for the assemblies I'm searching he assembly) (built from the giant table "assembly_summary_genbank.txt" which I can find somewhere in NCBI's FTP site.)
Then I grab the prefix of the locus tag for the first header and map it to the assembly ID from the filename: (eg, I grab "GS601_" from this header: >lcl|WVIE01000010.1_prot_NDJ17650.1_1 [locus_tag=GS601_10170] [protein=SPASM domain-containing protein] [protein_id=NDJ17650.1] [location=complement(79..1437)] [gbkey=CDS]))
(It would be better to use grep -H for all of the protein IDs of all the hits, but I can't get grep to work in a for loop, for some reason).
Does anyone have a simpler way of doing this?
Thanks genomax! I hadn't messed around with Entrez-direct yet, but just installed it and was able to reproduce your example. I will read the docs now, and report back soon before accepting this answer...
How can I get it to give me the genbank instead of refseq assembly IDs? And also it would be better to query with the protein accessions (eg, NDJ17650.1), since occasionally the locus tag isn't available (or doesn't exist for that assembly).
I also got grep to work in a loop, now, so that is also an appealing strategy now... (following this post on stack exchange, I found out I must have been having problems with the linebreaks in my ID list (originating from DOS?))
Use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.Following should get the GenBank ID.