Question: Determine NCBI Nucleotide source of .fasta amino acid file
0
gravatar for LRStar
5 weeks ago by
LRStar190
United States
LRStar190 wrote:

I have a .fasta file with amino acid sequences. The beginning of the file is as follows:

>lcl|NC_019674.1_prot_1 [locus_tag=BN341_RS00005] [protein=OmpA family protein] [pseudo=true] [location=join(1804546..1804601,1..456)] [gbkey=CDS]
MKKWFLAAAVVACVLMTGCPPRLKKPPPPPNPPPNLKNTPCPKKRPRPSP*KNPSPM*KVARLSGKCILI
LTNTMCVQTCKAQSMKP*KKSKNTV*KYSWRATPMSLVQANIILP*ATNAALV*KMF*LSRASVRTVLKW
*VLEKPNPFARKKLQSATVKTAVLTSKLWT

I am trying to find the source of this file. I believe I obtained it on NCBI Nucleotide (https://www.ncbi.nlm.nih.gov/nuccore/) while searching for the complete genome of Helicobacter species. Once I found the species, I believe I clicked on "Send to", "Coding sequences", and then "FASTA protein". Then, I downloaded that as .fasta file.

Now, I am trying to determine the exact origin of this .fasta file I have. I am attempting to give the NCBI Nucleotide link to colleagues. Is it possible for me to 'reverse engineer' this type of file and determine where I downloaded it from?

nucleotide ncbi fasta • 131 views
ADD COMMENTlink modified 4 weeks ago by genomax78k • written 5 weeks ago by LRStar190

You could also search NCBI with NC_019674 which will lead you to this genome page. Protein and nucleotide fasta sequences are available in top box. Note: These are representative sequences for multiple genomes and are labeled with WP identifiers.

ADD REPLYlink written 4 weeks ago by genomax78k
2
gravatar for gb
5 weeks ago by
gb1.5k
gb1.5k wrote:

this? https://www.ncbi.nlm.nih.gov/nuccore/NC_019674.1/ or this? https://www.ncbi.nlm.nih.gov/nuccore/NC_019674.1?location=1804546:1804601,1:456

ADD COMMENTlink written 5 weeks ago by gb1.5k

Thanks @gb. I believe it is the first one. What was your process for determining that? I have a few other files like this and believe perhaps my navigation skill son NCBI Nucleotide are not up to par... because I often cannot reverse engineer and figure out where my files came from. (Yes, I plan to take better notes when I create files in the future as well :)

ADD REPLYlink written 5 weeks ago by LRStar190
1

To be honest I would never thought my comment would help you. But, I can try to explain. The header or description of a sequence is the line starting with ">". In your case:

>lcl|NC_019674.1_prot_1 [locus_tag=BN341_RS00005] [protein=OmpA family protein] [pseudo=true] [location=join(1804546..1804601,1..456)] [gbkey=CDS]

The header of sequences in the NCBI database are build in a specific way, it is mostly split with "|" characters. The first item in your case is:

>lcl

That first item tells you what kind of id the next item is. (https://en.wikipedia.org/wiki/FASTA_format#NCBI_identifiers). The next one is the identifier itself, in your case:

NC_019674.1

You can use this "code" to look up the sequence on the ncbi website. In your case this record contains a lot of information and will not be fully shown on the page by default. You can get more info if you click on "customize view". You can look for those id's here for example https://www.ncbi.nlm.nih.gov/ use the search field on top of the page.

Oke, to add to this comment I want to say that this is a very simple and very basic explanation. Maybe some one else wants to explain it better or in more detail. For example there are many more ways to look up those id's with scripts, R packages etc. Even the id or mostly called accession is build up in a specific way https://www.ncbi.nlm.nih.gov/Sequin/acc.html and those accession are also connected with taxonomy, bioprojects and many more things.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by gb1.5k
0
gravatar for genomax
4 weeks ago by
genomax78k
United States
genomax78k wrote:

Looks like you may have retrieved that file by using Entrezdirect like so:

$ esearch -db nuccore -query "NC_019674.1" | efetch -format fasta_cds_aa 
>lcl|NC_019674.1_prot_1 [locus_tag=BN341_RS00005] [protein=OmpA family protein] [pseudo=true] [location=join(1804546..1804601,1..456)] [gbkey=CDS]
MKKWFLAAAVVACVLMTGCPPRLKKPPPPPNPPPNLKNTPCPKKRPRPSP*KNPSPM*KVARLSGKCILI
LTNTMCVQTCKAQSMKP*KKSKNTV*KYSWRATPMSLVQANIILP*ATNAALV*KMF*LSRASVRTVLKW
*VLEKPNPFARKKLQSATVKTAVLTSKLWT
>lcl|NC_019674.1_prot_WP_015105870.1_2 [locus_tag=BN341_RS00010] [protein=TPR repeat containing exported protein; Putative periplasmic protein contains a protein prenylyltransferase domain] [protein_id=WP_015105870.1] [location=457..1398] [gbkey=CDS]
MRFLGLLVGGLLCAEPSAFELQSGATKQELSTLKSSNKNLGDILTALKGQTNGLLQGQEGLRSLVEGQGI
RLKKATDALNAHSDELKALKSTQDAQADLIKQQADLIHTLKTQIQTNQDALANFEKKNQETQQLLENMRA
...................

So if you need to get the nucleotide sequence then you should do the following (sequence truncated to save space):

$ esearch -db nuccore -query "NC_019674.1" | efetch -format fasta_cds_na 
>lcl|NC_019674.1_cds_1 [locus_tag=BN341_RS00005] [protein=OmpA family protein] [pseudo=true] [location=join(1804546..1804601,1..456)] [gbkey=CDS]
ATGAAAAAGTGGTTTTTAGCCGCCGCAGTTGTGGCGTGTGTGTTGATGACAGGGTGCCCCCCCAGGCTAA
AGAAGCCACCCCCGCCCCCAAACCCGCCCCCAAACCTGAAGAACACACCGTGCCCAAAGAAGAGGCCCAG
GCCAAGCCCGTAGAAAAACCCAAGCCCCATGTAGAAAGTGGCACGATTGTCGGGCAAGTGTATTTTGATT
......
>lcl|NC_019674.1_cds_WP_015105870.1_2 [locus_tag=BN341_RS00010] [protein=TPR repeat containing exported protein; Putative periplasmic protein contains a protein prenylyltransferase domain] [protein_id=WP_015105870.1] [location=457..1398] [gbkey=CDS]
GTGCGGTTTTTAGGCTTGCTTGTGGGGGGGCTCTTGTGCGCTGAGCCCTCCGCTTTTGAACTGCAAAGTG
GGGCGACCAAGCAAGAGTTAAGTACCCTAAAAAGCAGCAATAAAAACCTAGGTGACATCTTAACCGCGCT
.................
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by genomax78k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 785 users visited in the last hour