how to map NCBI protein IDs to assembly IDs
1
0
Entering edit mode
2.2 years ago

Hi, I am really struggling to figure out how to obtain genome assembly IDs from genbank for a batch of protein accessions.

I have locally downloaded a set of several thousand bacterial genomes, which I am concatenating and using to score with a set of custom HMMs using HMMer. I have been using the _translated_cds.faa files from NCBI's FTP, when available, since they have more information in the header lines (compared to the _protein.faa files). However, neither of these have any information which links a protein hit back to the genome assembly from which it was obtained!

How can I map my hits back to the original assemblies they came from? This seems so crazy that the translated_CDS headers don't have any information about the actual genome/assembly/organism which they came from!!!

My best bet so far has been to look at each individual .faa file (before I'd concatenated them), since the assembly ID is present in the filename.
(For example: GCA_010091945.1_ASM1009194v1_translated_cds.faa >>> this supplies me with GCA_010091945.1, which I can then use to link to a table with all the metadata for the assemblies I'm searching he assembly) (built from the giant table "assembly_summary_genbank.txt" which I can find somewhere in NCBI's FTP site.)

Then I grab the prefix of the locus tag for the first header and map it to the assembly ID from the filename: (eg, I grab "GS601_" from this header: >lcl|WVIE01000010.1_prot_NDJ17650.1_1 [locus_tag=GS601_10170] [protein=SPASM domain-containing protein] [protein_id=NDJ17650.1] [location=complement(79..1437)] [gbkey=CDS]))

(It would be better to use grep -H for all of the protein IDs of all the hits, but I can't get grep to work in a for loop, for some reason).

Does anyone have a simpler way of doing this?

genome sequence ncbi RNA-Seq • 431 views
0
Entering edit mode

Thanks genomax! I hadn't messed around with Entrez-direct yet, but just installed it and was able to reproduce your example. I will read the docs now, and report back soon before accepting this answer...

How can I get it to give me the genbank instead of refseq assembly IDs? And also it would be better to query with the protein accessions (eg, NDJ17650.1), since occasionally the locus tag isn't available (or doesn't exist for that assembly).

I also got grep to work in a loop, now, so that is also an appealing strategy now... (following this post on stack exchange, I found out I must have been having problems with the linebreaks in my ID list (originating from DOS?))

0
Entering edit mode

Use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

Following should get the GenBank ID.

$esearch -db nuccore -query "GS601_10170" | elink -target assembly | esummary | xtract -pattern DocumentSummary -element Genbank GCA_010091945.1  ADD REPLY 0 Entering edit mode 2.2 years ago GenoMax 115k You could use Entrez-direct and the locus tag like this: $ esearch -db nuccore -query "GS601_10170" | elink -target assembly | esummary | xtract -pattern DocumentSummary -element AssemblyAccession
GCF_010091945.1


That protein ID is from IPG database and harder to map back to a unique genome.