Question: Extract anticodon and other information from multiple genbank files
0
gravatar for jg
26 days ago by
jg0
jg0 wrote:

I have around 1000 gbff (genbank) files and I want to extract from each:

  1. Accession number e.g. LOCUS NZ_CP011636
  2. tax ID e.g. taxon:571
  3. all the tRNA anticodons (zero or multiple per file) e.g. /anticodon=pos:complement(1141190..1141192),aa:Met,seq:cat) (I actually just want the aa:Met,seq:cat part)

(note that the latter sometimes spills across two lines).

and compile output into a table containing rows for each file.

Example of file: https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP011636

grep -e "LOCUS" -e "taxon:" -e "anticodon=" Klebsiella_oxytoca.gbff > output.txt

This sort of works but grep doesn't work when #3 spills onto two lines and also I need to store all the output from multiple files. Thanks for any help!

genbank forum • 148 views
ADD COMMENTlink modified 26 days ago by vkkodali560 • written 26 days ago by jg0
4
gravatar for vkkodali
26 days ago by
vkkodali560
United States
vkkodali560 wrote:

This is probably best done using BioPython for a more 'readable' code but if you wish, you can do this using Entrez Direct on the command line using this command

## <filename.txt> should have nucleotide accessions; one per line
epost -db nuccore -input <filename.txt> \
    | efetch -db nuccore -format gb -mode xml -style withparts \
    | xtract -pattern GBSeq -tab '\n' -element GBSeq_accession-version \
        -group GBFeature -if GBFeature_key -equals 'source' \
            -block GBQualifier -if GBQualifier_name -equals 'db_xref' \
            -tab '\n' -element GBQualifier_value \
        -group GBFeature -if GBFeature_key -equals 'tRNA' \
            -block GBQualifier -if GBQualifier_name -equals 'anticodon' \
            -tab '\n' -element GBQualifier_value 

## sample output
NZ_CP011636.1
taxon:571
(pos:complement(60036..60038),aa:Glu,seq:ttc)
(pos:complement(118496..118498),aa:Thr,seq:ggt)
ADD COMMENTlink modified 25 days ago • written 26 days ago by vkkodali560

This is really helpful thanks! Just one issue: this genome has 86 tRNAs but this only spits out 8 lines. Any ideas why? Also, is there a way to give it a list of accession numbers to process and then save the output? Thanks again!

ADD REPLYlink modified 26 days ago • written 26 days ago by jg0
1

To keep the output size manageable I piped it tohead. Remove that and you will see all 86 features

ADD REPLYlink written 26 days ago by vkkodali560

Thanks! Is there a way to give it a list of accession numbers to process and then save all the output? Thanks again!

ADD REPLYlink written 26 days ago by jg0
1

I have updated the command above to clean it up a little bit, remove the final head command and add an epost step to read nucleotide accessions from a file. Replace <filename.txt> with your file. Good luck!

ADD REPLYlink written 25 days ago by vkkodali560

Awesome! Sorry to ask another question but -group command is not found and it doesn't appear to be in the /edirect directory. Any ideas? Thanks!

ADD REPLYlink written 24 days ago by jg0
1

This sounds like a formatting issue. See what happens if you paste the entire command as a single line without the backslashes.

ADD REPLYlink written 24 days ago by vkkodali560
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1287 users visited in the last hour