Question: Extract anticodon and other information from multiple genbank files
0
gravatar for jg
3 months ago by
jg0
jg0 wrote:

I have around 1000 gbff (genbank) files and I want to extract from each:

  1. Accession number e.g. LOCUS NZ_CP011636
  2. tax ID e.g. taxon:571
  3. all the tRNA anticodons (zero or multiple per file) e.g. /anticodon=pos:complement(1141190..1141192),aa:Met,seq:cat) (I actually just want the aa:Met,seq:cat part)

(note that the latter sometimes spills across two lines).

and compile output into a table containing rows for each file.

Example of file: https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP011636

grep -e "LOCUS" -e "taxon:" -e "anticodon=" Klebsiella_oxytoca.gbff > output.txt

This sort of works but grep doesn't work when #3 spills onto two lines and also I need to store all the output from multiple files. Thanks for any help!

genbank forum • 203 views
ADD COMMENTlink modified 3 months ago by vkkodali930 • written 3 months ago by jg0
4
gravatar for vkkodali
3 months ago by
vkkodali930
United States
vkkodali930 wrote:

This is probably best done using BioPython for a more 'readable' code but if you wish, you can do this using Entrez Direct on the command line using this command

## <filename.txt> should have nucleotide accessions; one per line
epost -db nuccore -input <filename.txt> \
    | efetch -db nuccore -format gb -mode xml -style withparts \
    | xtract -pattern GBSeq -tab '\n' -element GBSeq_accession-version \
        -group GBFeature -if GBFeature_key -equals 'source' \
            -block GBQualifier -if GBQualifier_name -equals 'db_xref' \
            -tab '\n' -element GBQualifier_value \
        -group GBFeature -if GBFeature_key -equals 'tRNA' \
            -block GBQualifier -if GBQualifier_name -equals 'anticodon' \
            -tab '\n' -element GBQualifier_value 

## sample output
NZ_CP011636.1
taxon:571
(pos:complement(60036..60038),aa:Glu,seq:ttc)
(pos:complement(118496..118498),aa:Thr,seq:ggt)
ADD COMMENTlink modified 3 months ago • written 3 months ago by vkkodali930

This is really helpful thanks! Just one issue: this genome has 86 tRNAs but this only spits out 8 lines. Any ideas why? Also, is there a way to give it a list of accession numbers to process and then save the output? Thanks again!

ADD REPLYlink modified 3 months ago • written 3 months ago by jg0
1

To keep the output size manageable I piped it tohead. Remove that and you will see all 86 features

ADD REPLYlink written 3 months ago by vkkodali930

Thanks! Is there a way to give it a list of accession numbers to process and then save all the output? Thanks again!

ADD REPLYlink written 3 months ago by jg0
1

I have updated the command above to clean it up a little bit, remove the final head command and add an epost step to read nucleotide accessions from a file. Replace <filename.txt> with your file. Good luck!

ADD REPLYlink written 3 months ago by vkkodali930

Awesome! Sorry to ask another question but -group command is not found and it doesn't appear to be in the /edirect directory. Any ideas? Thanks!

ADD REPLYlink written 3 months ago by jg0
1

This sounds like a formatting issue. See what happens if you paste the entire command as a single line without the backslashes.

ADD REPLYlink written 3 months ago by vkkodali930
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1613 users visited in the last hour