Question

How to extract certain CDS from a GenBank file in linux terminal

0

Entering edit mode

3.9 years ago

arriyaz.nstu ▴ 30

Hi, I have a large GenBank file that contains multiple records. I got this file after performing BLAST against my query sequence. This .gb file contains all the results (full record in GenBank format for each accession) from the blast search. I want to extract a certain CDS from each record in that .gb file. As additional complexity, this CDS is a join of two regions in the genome. Is there any automated process by using code in the Linux terminal, so that I can extract my target CDS from all records and save them in a separate file? Thanks in advance.

gene sequence genome • 5.3k views

ADD COMMENT • link updated 2.4 years ago by AP ▴ 80 • written 3.9 years ago by arriyaz.nstu ▴ 30

2

Entering edit mode

Please post example gb file and expected output.

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

Here I don't find any option to upload the GenBank file. But it is very long file, so do I just copy-paste it here?

ADD REPLY • link 3.9 years ago by arriyaz.nstu ▴ 30

1

Entering edit mode

Post one or two records as input and expected output for those one or two records.

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

Okay, according to your suggestion I cut off some portion of these genbank records. Here, there are two genbank record in a single file:

LOCUS       MK733608              136382 bp    DNA     linear   VRL 13-MAY-2020
DEFINITION  Human gammaherpesvirus 8 strain UNC_KICS010, complete genome.
ACCESSION   MK733608
VERSION     MK733608.1.
FEATURES             Location/Qualifiers
     source          1..136382
                     /organism="Human gammaherpesvirus 8"
                     /mol_type="genomic DNA"
                     /strain="UNC_KICS010"
     gene            1082..17050
                     /gene="ORF4"
                     /note="HHV8GK18_gp02"
     polyA_site      1113..1113
                     /gene="K1"
     CDS             1142..2794
                     /gene="ORF4"
                     /note="complement control protein; membrane protein;
                     contains four SCR domains"
                     /codon_start=1
                     /product="KCP"
                     /protein_id="QFU18707.1"
                     /translation="MAFLRQTLWILWTFTMVIGQDNEKCSQKTLIGYRLKMSRDGDIA
                     VGETVELRCRSGYTTYARNITATCLQGGTWSEPTATCNKKSCPNPGEIQNGKVIFHGG
                     QDALKYGANISYVCNEGYFLVGREYVRYCMIGASGQMAWSSSPPFCEKEKCHRPKIEN
                     GDFKPDKDYYEYNDAVHFECNEGYTLVGPHSIACAVNNTWTSNMPTCELAGCKFPSVT"
ORIGIN
        1 TACTAATTTT CAAAGGCGGG GTTCTGCCAG GCATAGTCTT TTTTTCTGGC GGCCCTTGTG
       61 TATACCTGTC TTTCAGACCT TGTTGGACAT CCTGTACAAT CAAGATGTTC CTGTATGTTG
      121 TCTGCAGTCT GGCGGTTTGC TTTCGAGGAC TATTAAGCCT TTCTCTGTTA TCGTCTCCAA
      181 CTTTGTGCCC TGGAGTGATT TCAACGCCTT ACAATTTGAC CTGTCCGTCT AATACATCCT
      241 TGCCAACATC CTGGTATTGC AACGATACTC GGCTTTTACG AGTGACGCAG GGAACATTGA
//
LOCUS       MK733606              136317 bp    DNA     linear   VRL 13-MAY-2020
DEFINITION  Human gammaherpesvirus 8 strain UNC_KICS009, complete genome.
ACCESSION   MK733606
VERSION     MK733606.1
FEATURES             Location/Qualifiers
     source          1..136317
                     /organism="Human gammaherpesvirus 8"
                     /mol_type="genomic DNA"
                     /strain="UNC_KICS009"
                     /gene="ORF4"
                     /note="HHV8GK18_gp02"
     polyA_site      1113..1113
                     /gene="K1"
     CDS             1142..2794
                     /gene="ORF4"
                     /note="complement control protein; membrane protein;
                     contains four SCR domains"
                     /codon_start=1
                     /product="KCP"
                     /protein_id="QFU18535.1"
                     /translation="MAFLRQTLWILWTFTMVIGQDNEKCSQKTLIGYRLKMSRDGDIA
                     VGETVELRCRSGYTTYARNITATCLQGGTWSEPTATCNKKSCPNPGEIQNGKVIFHGG
                     QDALKYGANISYVCNEGYFLVGREYVRYCMIGASGQMAWSSSPPFCEKEKCHRPKIEN
                     GDFKPDKDYYEYNDAVHFECNEGYTLVGPHSIACAVNNTWTSNMPTCELAGCKFPSVT
                     HGYPIQGFSLTYKHKQSVTFACNDGFVLRGSPTITCNVTEWDPPLPKCVLEDIDDPNN"
ORIGIN
        1 TACTAATTTT CAAAGGCGGG GTTCTGCCAG GCATAGTCTT TTTTTCTGGC GGCCCTTGTG
       61 TATACCTGTC TTTCAGACCT TGTTGGACAT CCTGTACAAT CAAGATGTTC CTGTATGTTG
      121 TCTGCAGTCT GGCGGTTTGC TTTCGAGGAC TATTAAGCCT TTCTCTGTTA TCGTCTCCAA
      181 CTTTGTGCCC TGGAGTGATT TCAACGCCTT ACAATTTGAC CTGTCTGTCT AATGCATCCT
      241 TGCCAATATC CTGGTATTGC AACAATACTC GGCTTTTGCG ACTGACGAAG AACACACTCC
      301 TTGCTGTCGC CATTGCCTGC AATTTTACTT GTGTGGAACA ATCTGGGCAT CGACAGAGC
//

And from this file I want to extract a cds suppose ORF4 (as fasta format) from every record. The output would be as follow (for character limitation I deleted some sequence from the fasta files):

>MK733608.1:1142-2794 Human gammaherpesvirus 8 strain UNC_KICS010, complete genome
ATGGCCTTTTTAAGACAAACACTGTGGATTTTATGGACATTTACCATGGTCATTGGCCAGGACAATGAAA
AGTGTTCCCAAAAAACCTTAATTGGATATAGACTTAAAATGTCTCGTGACGGTGACATTGCAGTTGGAGA
AACAGTGGAATTACGTTGTAGATCTGGATACACTACTTATGCCCGCAATATAACAGCAACATGTTTACAA
GGTGGGACGTGGTCTGAACCAACGGCAACATGTAACAAAAAGTCCTGTCCAAACCCAGGTGAAATACAAA
ATGGAAAGGTTATATTTCATGGTGGACAAGATGCCTTAAAATATGGGGCAAACATTTCATATGTTTGTAA
>MK733606.1:1142-2794 Human gammaherpesvirus 8 strain UNC_KICS009, complete genome
ATGGCCTTTTTAAGACAAACACTGTGGATTTTATGGACATTTACCATGGTCATTGGCCAGGACAATGAAA
AGTGTTCCCAAAAAACCTTAATTGGATATAGACTTAAAATGTCTCGTGACGGTGACATTGCAGTTGGAGA
AACAGTGGAATTACGTTGTAGATCTGGATACACTACTTATGCCCGCAATATAACAGCAACATGTTTACAA
GGTGGGACGTGGTCTGAACCAACGGCAACATGTAACAAAAAGTCCTGTCCAAACCCAGGTGAAATACAAA

ADD REPLY • link 3.9 years ago by arriyaz.nstu ▴ 30

3

Entering edit mode

try this code:

from Bio import SeqIO

    for rec in SeqIO.parse("test.gb", "genbank"):
        for feature in rec.features:
            if feature.type == "CDS":
                seq = feature.qualifiers["translation"]
                id = ">{}:{} {}".format'('rec.id, feature.location, rec.description)
                print("{}\n{}".format(id, *seq))

test.gb = input genbank
Please mind tabs in python (3)
Current code prints the sequences to the screen. Write it to a file once you are okay with the code.
Replace '(' with (

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

Thank you for your code. I run the code according to your instruction. But, it generating all amino acid sequences for every CDS. But my target is to get DNA sequence of a specific CDS (lets say, ORF4).

ADD REPLY • link 3.9 years ago by arriyaz.nstu ▴ 30

1

Entering edit mode

For nucleotide sequence, please try this:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

_Sequences = []

for rec in SeqIO.parse("test.gb", "genbank"):
    id = "{}:{}{}".format'('rec.id, rec.features[2].location, rec.description)
    sequence = SeqRecord(rec.seq, id=id, description="")
    _Sequences.append(sequence)

SeqIO.write(_Sequences, "test.fa", 'fasta')

Replace '(' with (

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

Hey, I am trying this code but here I getting the translated sequence of all the genes. But I want a translated sequence of a particular gene only from the GB file. Can you help me out with this?

ADD REPLY • link 2.9 years ago by Gen • 0

0

Entering edit mode

Hey Gen did you figure out your problem I am having the same issue

ADD REPLY • link 2.4 years ago by AP ▴ 80

score 0 · Answer 1 · 2020-07-23

0

Entering edit mode

3.8 years ago

brian.fristensky ▴ 460

See the BIRCH tutorial CREATING DATASETS OF RELATED SEQUENCES at http://home.cc.umanitoba.ca/~psgendb/birchhomedir/BIRCHDEV/public_html/tutorials/bioLegato/dataset/dataset.html

ADD COMMENT • link 3.8 years ago by brian.fristensky ▴ 460