Question: Perl script to download gene sequences
0
gravatar for dllopezr
10 months ago by
dllopezr40
dllopezr40 wrote:

Hi everyone!

This entrez command give me an outpout with the genes sequences of a gen/enzymes for various organism:

esearch -db gene -query "glutaminase-asparaginase [Gene/Protein Name] AND (bacteria [orgn] OR fungi [orgn] OR archaea [orgn]) AND alive [prop]" | efetch -format docsum | xtract -pattern GenomicInfoType -element ChrAccVer -element ChrStart -element ChrStop |xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta'

The output is similiar to:

>NC_030957.1:c4121890-4120582 Colletotrichum higginsianum
TGAGAGCTTCTTACTTGTCGACGCTGTTGTTGCCAGCTCTGGTAGCCCATGGTTTCGCCTCCCCAGTCGG
>NC_016603.1:c898826-897759 Acinetobacter pittii
TGTTGACTAAAACTGTTAAATCTTTAGGTTTAGCGATGGGCTTATTAG
>NC_002947.4:c2800289-2799201 Pseudomonas putida
TGAATGCCGCACTGAAAACCTTCGCCCCAAGCGCACTCGCCCTGCTGCTGATCCTGCCATCCAGCGCCTC

But I need to do this for several genes that i have in a first column of a table, like that:

GeneNameA     OtherColumn    OtherColumn

GeneNameB     OtherColumn    Other Colmn

I am searching for a Perl script that read the first column and pass each GeneName to this space of the entrez command : "X" [Gene/Protein Name], and create a multifasta files that contains the sequences for each Gene separetely.

My programming skills are poor yet and I am stuck in this part. I´ll will be grateful with your help!

ncbi sequence perl gene • 506 views
ADD COMMENTlink modified 10 months ago by WouterDeCoster38k • written 10 months ago by dllopezr40
1

I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 10 months ago by WouterDeCoster38k
1
gravatar for lieven.sterck
10 months ago by
lieven.sterck4.5k
VIB, Ghent, Belgium
lieven.sterck4.5k wrote:

no need to go for PERL here, some simple bash will do:

IFS=$'\n'
for i in $( cut -f1 table ); do 
 esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> fasta_file
done
unset IFS

table is your list and fasta_file the output file you want to create (one big multi fasta file with all requested sequences)

EDIT: added IFS declaration to avoid splitting on (unwanted) spaces in the search terms

ADD COMMENTlink modified 10 months ago • written 10 months ago by lieven.sterck4.5k

Thank you so much!

The code work but I forgot to say that my table in excel look like this:

Inositol pyrophosphate synthase  OtherColumn  OtherColumn

There are spaces in the firts column, and the code only read the first world "Inositol"

I translated the excel table to csv and tab separate file but don't work

Any Idea of how to fix it?

ADD REPLYlink modified 10 months ago • written 10 months ago by dllopezr40

Save your file as comma delimited text and then replace cat table | cut -f1 with awk -F ',' '{print $1}' your_file and see if that does the trick.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax65k

or save as a tab delineated text file and then the original command should also work, but the one of genomax is probably more secure

ADD REPLYlink written 10 months ago by lieven.sterck4.5k

Thanks genomax and lieven.sterck. But didn't work, the output is the same.

ADD REPLYlink modified 10 months ago • written 10 months ago by dllopezr40

strange, it should not be.

Can you post the first lines or so of this csv/tab file?

ADD REPLYlink written 10 months ago by lieven.sterck4.5k

Hi, I only test the script with the first row for now, In this case my $i must be Inositol pyrophosphate synthase, but the script do the query only with inositol, and the fasta created is inositol.fasta. I want the query to be over over the whole name and the file created with the name "Inositol_pyrophosphate_synthase.fasta" or similar. This is the file:

Inositol pyrophosphate synthase Inmovilización  Sintesis de polifosfatos    NA  16

this is the script:

#!/bin/sh
for i in $(  cat prueba.txt | cut -f1 ); do 
 esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [orgn] OR fungi [orgn] OR archaea [orgn]) AND alive [prop]" | efetch -format docsum | xtract -pattern GenomicInfoType -element ChrAccVer -element ChrStart -element ChrStop |xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' >> "$i".fasta
done

I am triyng with tab separate files and comma separate values files but the output is the same ever. I also try with genomax variation but not work too.

ADD REPLYlink written 10 months ago by dllopezr40

are "Inositol" and "pyrophosphate" and "synthase" present in a single column in your excel list? if not neither option given will work I'm afraid

Post the result of head prueba.txt , so we can inspect it

ADD REPLYlink modified 10 months ago • written 10 months ago by lieven.sterck4.5k

Hi lieven.sterck

Yes in excel "Inositol" and "pyrophosphate" and "synthase" are present in the same column, and I translated this to csv or tab file. The header of prueba.txt is:

Inositol pyrophosphate synthase Inmovilización  Sintesis de polifosfatos    NA  16

Thank you for your help

ADD REPLYlink written 10 months ago by dllopezr40

OK, figured it out. The loop keep splitting on space regardless of the tab or csv file. Here's how to solve it, you need to disable the space splitting of the for loop or alternatively use a while loop in stead.

for loop: set IFS specifically on newlines:

IFS=$'\n'
    for i in $( cat table | cut -f1 ); do 
     j=${i// /_} 
     esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> ${j}.fasta
    done
unset IFS

or using a while loop:

while read i ; do esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> fasta_file ; done <(cat table | cut -f1 )

I also added a line in the for loop code to create nice fasta file names (same can be done in the while loop as well)

I'll amend my original answer as such too

ADD REPLYlink modified 10 months ago • written 10 months ago by lieven.sterck4.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 753 users visited in the last hour