Question

Extract anticodon and other information from multiple genbank files

0

Entering edit mode

5.4 years ago

jg • 0

I have around 1000 gbff (genbank) files and I want to extract from each:

Accession number e.g. LOCUS NZ_CP011636
tax ID e.g. taxon:571
all the tRNA anticodons (zero or multiple per file) e.g. /anticodon=pos:complement(1141190..1141192),aa:Met,seq:cat) (I actually just want the aa:Met,seq:cat part)

(note that the latter sometimes spills across two lines).

and compile output into a table containing rows for each file.

Example of file: https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP011636

grep -e "LOCUS" -e "taxon:" -e "anticodon=" Klebsiella_oxytoca.gbff > output.txt

This sort of works but grep doesn't work when #3 spills onto two lines and also I need to store all the output from multiple files. Thanks for any help!

Genbank • 1.3k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 5.4 years ago by jg • 0

score 4 · Accepted Answer · 2018-11-15

4

Entering edit mode

5.4 years ago

vkkodali_ncbi ★ 3.7k

This is probably best done using BioPython for a more 'readable' code but if you wish, you can do this using Entrez Direct on the command line using this command

## <filename.txt> should have nucleotide accessions; one per line
epost -db nuccore -input <filename.txt> \
    | efetch -db nuccore -format gb -mode xml -style withparts \
    | xtract -pattern GBSeq -tab '\n' -element GBSeq_accession-version \
        -group GBFeature -if GBFeature_key -equals 'source' \
            -block GBQualifier -if GBQualifier_name -equals 'db_xref' \
            -tab '\n' -element GBQualifier_value \
        -group GBFeature -if GBFeature_key -equals 'tRNA' \
            -block GBQualifier -if GBQualifier_name -equals 'anticodon' \
            -tab '\n' -element GBQualifier_value 

## sample output
NZ_CP011636.1
taxon:571
(pos:complement(60036..60038),aa:Glu,seq:ttc)
(pos:complement(118496..118498),aa:Thr,seq:ggt)

ADD COMMENT • link 5.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

This is really helpful thanks! Just one issue: this genome has 86 tRNAs but this only spits out 8 lines. Any ideas why? Also, is there a way to give it a list of accession numbers to process and then save the output? Thanks again!

ADD REPLY • link 5.4 years ago by jg • 0

1

Entering edit mode

To keep the output size manageable I piped it tohead. Remove that and you will see all 86 features

ADD REPLY • link 5.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Thanks! Is there a way to give it a list of accession numbers to process and then save all the output? Thanks again!

ADD REPLY • link 5.4 years ago by jg • 0

1

Entering edit mode

I have updated the command above to clean it up a little bit, remove the final head command and add an epost step to read nucleotide accessions from a file. Replace <filename.txt> with your file. Good luck!

ADD REPLY • link 5.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Awesome! Sorry to ask another question but -group command is not found and it doesn't appear to be in the /edirect directory. Any ideas? Thanks!

ADD REPLY • link 5.4 years ago by jg • 0

1

Entering edit mode

This sounds like a formatting issue. See what happens if you paste the entire command as a single line without the backslashes.

ADD REPLY • link 5.4 years ago by vkkodali_ncbi ★ 3.7k