How to extract translated sequence of a partcular gene by parsing GenBank file
3
0
Entering edit mode
2.9 years ago
Gen • 0

Hey, I have a large GB file that contains thousands of records. I used the NCBI E-utility tool to retrieve the GB formatted files. Now, I need the translated sequence of a particular gene (pol) along with the accession number and Definition in a single fasta file. Here, I am giving an example that can help in easy understanding:

INPUT:

LOCUS       LC570899                 893 bp    RNA     linear   VRL 08-AUG-2020
DEFINITION  Human immunodeficiency virus 1 msaha01609 pol gene for pol protein,
            partial cds.
ACCESSION   LC570899
VERSION     LC570899.1
KEYWORDS    .
SOURCE      Human immunodeficiency virus 1 (HIV-1)
  ORGANISM  Human immunodeficiency virus 1
            Viruses; Riboviria; Pararnavirae; Artverviricota; Revtraviricetes;
            Ortervirales; Retroviridae; Orthoretrovirinae; Lentivirus.
REFERENCE   1
  AUTHORS   Bhatta,M., Nandi,S. and Saha,M.K.
  TITLE     HIV Drug Related
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 893)
  AUTHORS   Bhatta,M., Nandi,S. and Saha,M.K.
  TITLE     Direct Submission
  JOURNAL   Submitted (20-JUL-2020) Contact:Mihir Bhatta ICMR-National
            Institute of Cholera and Enteric Disease, Virology; P-33, CIT Road,
            Scheme-XM, Beliaghata, Kolkata, Kolkata, West Bengal 700010, India
FEATURES             Location/Qualifiers
     source          1..893
                     /organism="Human immunodeficiency virus 1"
                     /mol_type="genomic RNA"
                     /isolate="msaha01609"
                     /host="Homo sapiens"
                     /db_xref="taxon:11676"
                     /country="India"
                     /collection_date="2017-07-19"
                     /collected_by="Srijita Nandi"
                     /identified_by="Mihir Bhatta"
     gene            <1..>893
                     /gene="pol"
     CDS             <1..>893
                     /gene="pol"
                     /note="protease and reverse transcriptase"
                     /codon_start=3
                     /product="pol protein"
                     /protein_id="BCK50781.1"
                     /translation="QRPLVPIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGI
                     GGFIKVRQYDQIPIEICGXKAIGTVLVGPTPVNIIGRNLLTQLGCTLNFPISPIETVP
                     VKLKPGMDGPKVKQWPLTEEKIKALTAICEEMEKEGKISKIGPENPYNTPIFAIKKKD
                     STKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFR
                     KYTAFTIPSLNNETPGIRYQYNVLPQGWKGSPXIFQASMTKILEPFRAQNPEIVIYQY
                     MDDLYVGSDLEIGQHRAKIEE"
ORIGIN      
        1 ggcagcgacc ccttgtccca ataaaagtag ggggtcagac aaaagaggct ctcttagaca
       61 caggagcaga tgatacagta ttagaagaaa taaatttgcc aggaaaatgg aaaccaaaaa
      121 tgataggagg aattggaggt ttyatcaaag tgagacaata tgatcaaata cctatagaaa
      181 tttgtggama aaaggctata ggtacagtat tagtgggacc tacacctgtc aacataattg
      241 gaagaaatct gttgactcag cttggatgca cactaaattt tccaattagt cccattgaaa
      301 ctgtaccagt aaaattaaaa ccaggaatgg atggcccaaa ggttaaacaa tggccattga
      361 cagaagagaa aataaaagca ttaacagcaa tttgtgagga aatggagaag gaaggaaaaa
      421 tttcaaaaat tgggcctgaa aatccatata acactccaat atttgccata aaaaagaagg
      481 acagtactaa gtggagaaaa ttagtagatt tcagggaact caataaaaga actcaagatt
      541 tttgggaagt ccaattagga ataccacacc cagcagggtt aaaaaagaaa aaatcagtga
      601 cagtactgga tgtgggggat gcatattttt cagttccttt atatgaagay ttcaggaaat
      661 atactgcatt caccatacct agtttaaaca atgaaacacc agggattaga tatcaatata
      721 atgtgcttcc acagggatgg aaaggatcac cakcaatatt ccaggcyagc atgacaaaaa
      781 tcttagagcc ctttagggca caaaatccag aaatagtcat ctatcaatat atggatgact
      841 tgtatgtagg atctgactta gaaatagggc aacatagagc aaaaatagaa gaa
//

LOCUS       EU158868                1703 bp    DNA     linear   VRL 26-JUL-2016
DEFINITION  HIV-1 isolate G8-AFMC-5 from India gag protein (gag) and pol
            protein (pol) genes, partial cds.
ACCESSION   EU158868
VERSION     EU158868.1
KEYWORDS    .
SOURCE      Human immunodeficiency virus 1 (HIV-1)
  ORGANISM  Human immunodeficiency virus 1
            Viruses; Riboviria; Pararnavirae; Artverviricota; Revtraviricetes;
            Ortervirales; Retroviridae; Orthoretrovirinae; Lentivirus.
REFERENCE   1  (bases 1 to 1703)
  AUTHORS   Lall,M., Gupta,R.M., Sen,S., Kapila,K., Tripathy,S.P. and
            Paranjape,R.S.
  TITLE     Profile of primary resistance in HIV-1-infected treatment-naive
            individuals from Western India
  JOURNAL   AIDS Res. Hum. Retroviruses 24 (7), 987-990 (2008)
   PUBMED   18593351
REFERENCE   2  (bases 1 to 1703)
  AUTHORS   Lall,M., Gupta,R.M., Sen,S., Kapila,K., Tripathy,S.P. and
            Paranjpe,R.S.
  TITLE     Direct Submission
  JOURNAL   Submitted (17-SEP-2007) Dept. of Microbiology, Armed Forces Medical
            College, Sholapur Road, Pune, Maharashtra 411040, India
FEATURES             Location/Qualifiers
     source          1..1703
                     /organism="Human immunodeficiency virus 1"
                     /proviral
                     /mol_type="genomic DNA"
                     /isolate="G8-AFMC-5"
                     /db_xref="taxon:11676"
                     /country="India"
                     /note="antiretroviral naive individual;
                     subtype: C"
     gene            <1..241
                     /gene="gag"
     CDS             <1..241
                     /gene="gag"
                     /codon_start=2
                     /product="gag protein"
                     /protein_id="ACB20420.1"
                     /translation="REGPIMRDCTERQANFLGKIWPSLKGRPGNFLQSRPEPTAPPAE
                     SFRFEEPTPAPKQEPKDREPLTALRSLFGSDPLSQ"
     gene            <46..>1703
                     /gene="pol"
     CDS             <46..>1703
                     /gene="pol"
                     /codon_start=1
                     /product="pol protein"
                     /protein_id="ACB20421.1"
                     /translation="FFRENLAFPQGEAREFPPEQTRANSPTSRELQVRGANPSSEAGA
                     ERQGALNCPQITLWQRPLVSIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGI
                     GGFIKVRQYDQIPIEICGKXAIGTVLVGPTPVNIIGRNMLTQLGCTLNFPISPIETVP
                     VKLKPGMDGPKVKQWPLTEEKIKALTEICDEMEKEGKITKIGPENPYNTPIFAIKKKD
                     STKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFR
                     KYTAFTIPSTNNETPGIRYQYNVLPQGWKGSPAIFQASMTKILEPFREQNPEIVIYQY
                     MDDLYVGSDLEIGQHRAKIEELREHLLKWGFTTPDKKHQKEPPFLWMGYELHPDKWTV
                     QPIQLPEKDSWTVNDIQKLVGKLNWASQIYPGIKIRQLCKLLRGAKALTEIVPLTKEA
                     ELELAENREILKEPVHGAYYDPSKDLIAEIQKQGRDQWTYQIYQEPFKNLKTGKYAKM
                     RTAHTNDVKQLTEAVQKIAMESIVIWGKTPKFRLPIQKETWETW"
ORIGIN      
        1 aagggaagga cccataatga gagactgtac tgaaagacag gctaattttt tagggaaaat
       61 ttggccttcc ctcaagggga ggccagggaa tttcctccag agcagaccag agccaacagc
      121 cccaccagca gagagcttca ggttcgagga gccaacccca gctccgaagc aggagccgaa
      181 agacagggag cccttaactg ccctcagatc actctttggc agcgacccct tgtctcaata
      241 aaagtagggg gccaaacaaa agaggctctc ttagacacag gagcagatga tacagtatta
      301 gaagaaataa atttgccagg gaaatggaaa ccaaaaatga taggaggaat tggaggtttt
      361 atcaaagtaa gacaatatga tcaaatacct atagaaattt gtggaaaaar ggctataggt
      421 acagtattag taggacccac acctgtcaac ataattggaa gaaatatgtt gactcagctt
      481 ggatgcacac taaattttcc aatcagtcct attgaaactg taccagtaaa attaaagcca
      541 ggaatggatg gcccaaaggt taaacaatgg ccattgacag aagagaaaat aaaagcatta
      601 acagaaatct gtgatgaaat ggagaaggaa ggaaaaatta caaaaattgg gcctgaaaat
      661 ccatataaca ctccaatatt ygccataaaa aagaaggaca gtactaagtg gagaaaatta
      721 gtagatttca gggaactcaa taaaagaact caagattttt gggaagtcca attaggaata
      781 ccacacccag cagggttaaa raagaaaaaa tcagtgacag tactagatgt gggggatgca
      841 tatttttcag tacctttata tgaagacttc aggaagtata ctgcattcac catacctagt
      901 acaaacaatg aaacaccagg gattaggtat caatataatg tgcttccaca gggatggaaa
      961 ggatcaccag caatattcca ggctagcatg acaaaaatct tagagccctt tagggaacaa
     1021 aatccagaaa tagtcatcta tcaatatatg gatgacttgt atgtaggatc tgacttagaa
     1081 atagggcaac atagagctaa aatagaggag ttaagagaac atctgttaaa gtggggattt
     1141 accacaccag ataagaagca tcagaaagaa cccccatttc tttggatggg gtatgaactc
     1201 catcctgaca aatggacagt acagcctata cagctgccag aaaaggatag ctggactgtc
     1261 aatgatatac agaagttagt gggaaaatta aactgggcaa gtcagattta cccaggaatt
     1321 aaaataaggc aactttgtaa actccttagg ggggccaaag cactaacaga aatagtacca
     1381 ctaactaaag aagcagaatt agaattggca gaaaacaggg aaattctaaa agaaccagta
     1441 catggagcat attatgaccc atcaaaagac ttaatagctg aaatccagaa acaggggcgg
     1501 gaccagtgga catatcaaat ttaccaggaa ccattcaaaa atctgaaaac agggaagtat
     1561 gcaaaaatga ggactgccca cactaatgat gtaaaacagt taacagaggc tgtgcagaaa
     1621 atagccatgg aaagcatagt aatatgggga aagactccta aatttagatt acccatccaa
     1681 aaggaaacat gggagacatg gtg
//

The Output I want:

 LC570899 Human immunodeficiency virus 1 msaha01609 pol gene for pol protein, partial cds
QRPLVPIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGI GGFIKVRQYDQIPIEICGXKAIGTVLVGPTPVNIIGRNLLTQLGCTLNFPISPIETVP
VKLKPGMDGPKVKQWPLTEEKIKALTAICEEMEKEGKISKIGPENPYNTPIFAIKKKD
STKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFR
KYTAFTIPSLNNETPGIRYQYNVLPQGWKGSPXIFQASMTKILEPFRAQNPEIVIYQY
MDDLYVGSDLEIGQHRAKIEE

EU158868 HIV-1 isolate G8-AFMC-5 from India gag protein (gag) and pol protein (pol) genes, partial cds.
FFRENLAFPQGEAREFPPEQTRANSPTSRELQVRGANPSSEAGA
ERQGALNCPQITLWQRPLVSIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGI
GGFIKVRQYDQIPIEICGKXAIGTVLVGPTPVNIIGRNMLTQLGCTLNFPISPIETVP
VKLKPGMDGPKVKQWPLTEEKIKALTEICDEMEKEGKITKIGPENPYNTPIFAIKKKD
STKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFR
KYTAFTIPSTNNETPGIRYQYNVLPQGWKGSPAIFQASMTKILEPFREQNPEIVIYQY
MDDLYVGSDLEIGQHRAKIEELREHLLKWGFTTPDKKHQKEPPFLWMGYELHPDKWTV
QPIQLPEKDSWTVNDIQKLVGKLNWASQIYPGIKIRQLCKLLRGAKALTEIVPLTKEA
ELELAENREILKEPVHGAYYDPSKDLIAEIQKQGRDQWTYQIYQEPFKNLKTGKYAKM
RTAHTNDVKQLTEAVQKIAMESIVIWGKTPKFRLPIQKETWETW

P.S. As you can be seen in example no.2, there are two genes gag and pol. I want a translated sequence of pol gene only.

gene GenBank genome Biopython sequence • 2.4k views
ADD COMMENT
0
Entering edit mode

It may be possible to change the original query to just get the pol protein sequence. It would avoid any parsing. Which specific viruses did you look at?

ADD REPLY
0
Entering edit mode

HIV-1 virus

ADD REPLY
0
Entering edit mode

I need the solution urgently. If you can help me out, then, please help...

ADD REPLY
0
Entering edit mode

You're going to have to share a couple of additional details with us if what you want is a "solution":

  • Is this the only GenBank record you're interested in?
  • If there are additional records, do you have a list of identifiers?
  • It sounds like this is a sub-problem you're trying to solve. Would it be possible for you to describe the actual problem you have (if this isn't it)?

Also, reposting the same question isn't doing you any favors.

ADD REPLY
0
Entering edit mode

Hey,

The above two records were just examples. In fact, I have a Genbank file which contains 9000 records which I retrieved using an NCBI E-utility tool. And, Yes, I do have a list of identifiers but I need to retrieve the protein sequences from the GB file only. This is the actual problem, I need to extract the protein sequence of the pol gene from this particular GB file in a fast file.

If you have any further questions let me know.....

ADD REPLY
1
Entering edit mode
2.9 years ago
GenoMax 141k

Here is one solution based on EntrezDirect:

$ esearch -db protein -query "Human immunodeficiency virus 1 AND pol [gene]" | efetch -format fasta > protein.fa

Output should look something like this (sequences truncated for space)

$ esearch -db protein -query "Human immunodeficiency virus 1 AND pol [gene]" | efetch -format fasta | grep -A 3 "^>" 
>QRN67799.1 pol protein, partial [Human immunodeficiency virus 1]
WQRPLVTIRVGGQIKEALLDTGADDTVLEEINLPGKWKPKMIGGIGGFIKVRQYDQILIEICGKKAIGTV
LVGPTPVNIIGRNMLTQLGCTLNFPISSIETVPVKLKPGMDGPKVKQWPLTEEKIKALXEICEEMEKEGK
ITKIGPENPYNTPVFVIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKXNKSMTVLDVGDAYF
--
>QRN67798.1 pol protein, partial [Human immunodeficiency virus 1]
WQRPLVSIKVGGQIKEALLDTGADDTVLEEINLPGKWKPKMIGGIGGFIKVRQYDQIPIEICGKKAIGTV
LVGPTPVNIIGRNMLTQLGCTLNFPISPIETVPVKLKPGMDGPKVRQWPLTEEKIKALTEICEEMEKEGQ
ISKIGPENPYNTPVFAIKRKDGTKWRKLMDFRELNKRTQDFSEVQLGIPHPAGLKKNKQMTVLDVGDAFF
--
>QRN67797.1 pol protein, partial [Human immunodeficiency virus 1]
WQRPLVSIKVGGQIKEALLDTGADDTVLEEINLPGKWKPKMIGGIGGFIKVRQYEEIPIEICGKKAIGTV
LVGPTPVNIIGRNILTQLGCTLNFPISTIETVPVKLKPGMDGPKVKQWPLTEEKIKALTAICEEMEKEGK
ITKIGPENPYNTPVFXIKRKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGIRKNKSVTVLDVGDAFF
--
>QRN67796.1 pol protein, partial [Human immunodeficiency virus 1]
WQRPLVSIKVGGQIKEALLDTGADDTVLEEINLPGKWKPKMIGGIGGFIKVRQYEEIPIEICGKKAIGTV
LVGPTPVNIIGRNMLTQLGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALTAICEEMEKEGK
ITKIGPENPYNTPVFXIKRKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGIKKNKSVTVLDVGDAYF
--
ADD COMMENT
1
Entering edit mode
2.9 years ago

check if following code works out (be mindful of copying errors from web):

import sys
from Bio import SeqIO
gbinput= sys.argv[1]
for i in SeqIO.parse(gbinput, "gb"):
    for f in i.features:
        if (f.type == "CDS" and f.qualifiers['gene'] == ['pol']):
            print(">{0}_{1}\n{2}".format(
                i.description, *(f.qualifiers['protein_id']), *(f.qualifiers['translation'])), end="\n")

output (test.gb = op gb):

python gbk2seq.py test.gb    

>Human immunodeficiency virus 1 msaha01609 pol gene for pol protein, partial cds_BCK50781.1
QRPLVPIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGIGGFIKVRQYDQIPIEICGXKAIGTVLVGPTPVNIIGRNLLTQLGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALTAICEEMEKEGKISKIGPENPYNTPIFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFRKYTAFTIPSLNNETPGIRYQYNVLPQGWKGSPXIFQASMTKILEPFRAQNPEIVIYQYMDDLYVGSDLEIGQHRAKIEE
>HIV-1 isolate G8-AFMC-5 from India gag protein (gag) and pol protein (pol) genes, partial cds_ACB20421.1
FFRENLAFPQGEAREFPPEQTRANSPTSRELQVRGANPSSEAGAERQGALNCPQITLWQRPLVSIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGIGGFIKVRQYDQIPIEICGKXAIGTVLVGPTPVNIIGRNMLTQLGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALTEICDEMEKEGKITKIGPENPYNTPIFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFRKYTAFTIPSTNNETPGIRYQYNVLPQGWKGSPAIFQASMTKILEPFREQNPEIVIYQYMDDLYVGSDLEIGQHRAKIEELREHLLKWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIQLPEKDSWTVNDIQKLVGKLNWASQIYPGIKIRQLCKLLRGAKALTEIVPLTKEAELELAENREILKEPVHGAYYDPSKDLIAEIQKQGRDQWTYQIYQEPFKNLKTGKYAKMRTAHTNDVKQLTEAVQKIAMESIVIWGKTPKFRLPIQKETWETW

This prints the output to screen. Change the code to print the result to a file. I appended protein ID for better understanding later.

ADD COMMENT
0
Entering edit mode
2.9 years ago
Dunois ★ 2.5k

So tanusharma547 , I have a solution for you.

Here's an R script that will download the protein sequences corresponding to your GenBank identifiers using the rentrez package.

#Loading packages.
library(rentrez)



#This section is just to get a bunch of GenBank nucleotide UIDs.
#The search term below is just an example.

searchterm <- "pol[GENE] AND Human immunodeficiency virus 1[ORGN] AND CDS[FKEY] AND Malaysia[WORD]"

cat("Getting the UIDs for the user-supplied search term.\n")
entrez_db_searchable("nucleotide")
idslist <- entrez_search(db = "nucleotide", 
                      term = searchterm, 
                      use_history = FALSE)
nrecs <- idslist$count
idslist <- entrez_search(db = "nucleotide", 
                      term = searchterm, 
                      use_history = FALSE, retmax = nrecs)

uids <- idslist$ids
rm(idslist)



#Getting the protein sequences associated with the UIDs.
#The while loop below will take some time; but it prints out its progress.
protseqs <- c()
nids <- length(uids)
rstart <- 1
while(rstart < nids){

  rend <- rstart + 100

  if(rend >= nids){
    rend <- nids
  }

  cat("Getting the protein sequences for the NCBI UIDs ", rstart, " to ", rend, "\n")


  curids <- entrez_link(dbfrom = "nucleotide", db = "protein", id = uids[rstart:rend])

  curseqs <- entrez_fetch(db = "protein", id = curids$links$nuccore_protein, rettype = "fasta")

  protseqs <- c(protseqs, curseqs)
  rstart <- rend


}


#Writing out the sequences to a FASTA file named "prots.fasta".
cat("Writing out the protein sequences to file.\n")
write.table(unlist(strsplit(protseqs, "\n")), 
            file = paste0(getwd(), "/prots.fasta"), 
            sep = "\n", 
            row.names = FALSE, 
            col.names = FALSE, 
            quote = FALSE)

cat("All done.\n")

There are two things you'll need to do here.

The first is install the R package rentrez (install.packages("rentrez")). You'll also have to install R and RStudio if you don't have them already (you don't really need RStudio because you can just execute this script from the command line using Rscript <scriptname.R>).

The second thing you need to do is replace the value assigned to the searchterm variable in the script with the query you used to obtain your set of GenBank records. So basically change the line searchterm <- "pol[GENE] AND Human immunodeficiency virus 1[ORGN] AND CDS[FKEY] AND Malaysia[WORD]" to searchterm <- <whatever_was_your_query>.

Then execute the script either via Rscript from the command line or via RStudio. You should get a file named prots.fasta at the location getwd() is pointing to. (I am not sure what OS you're running, but this'd default to /home/username on most Linux distributions.

I hope this helps. If you're still having trouble, and this is really, really urgent, I can download and send you the protein sequences if you can provide me with your GenBank search query or the identifiers.

ADD COMMENT

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6