Question

Script error when extracting sequences of a list of orthologous groups

0

Entering edit mode

7.1 years ago

biomonte ▴ 220

Hi everyone!

I have a list of orthologous groups that are single-copy genes among different species, e.g:

OG1.5_8000: sp1|Gene.10006::gi sp2|Gene.21810::GDRL01072110.1::g.21810::m.21810 sp3|Gene.49740::GECV01105569.1::g.49740::m.49740 sp4|XP_018429109.1 sp5|Gene.23019::GEGJ sp6|TRINITY_DN146402_c0_g3::TRINITY_DN146402_c0_g3_i2::g.23453::m.23453

I have a FASTA file (database) that contains protein sequences for each of those genes, e.g;

>sp1|Gene.10006::gi
HIEEEERNMAPERILITGCNRGIGYELVRQLIERRAPPNQIFATCRDPSSPQNKKLIDLA
ARYPKVVVIKLETTDANSIKEAVKVVEKHLNGAGLNLLINNAGIMPDSNLESADSADFLN
VYNTNVVGPFLVAKELLPFLKKAAAENPRKPLSCSKSAIINISTLLGSIEKTPETFSAYP

I'm using the following bash script to extract the sequences for all orthologous groups:

#!/bin/bash
# This is a bash script that extracts the sequences for all orthologous groups (OG). 
# It takes the a OG ids list as input and saves all sequences belonging to that group 
# from all organism in a file named with OG group in fasta format.
# Note that after the script is executed, there will be 'n' number of files (where 
# n=total number of OG's in the input list

# Arun Seetharam <aseetharam@purdue.edu>

scriptName="${0##*/}"
outdir=$(pwd)
function printUsage() {
    cat <<EOF
Synopsis
    $scriptName [-h | --help] [-o dir_name] input_ids_list database
Description
    Extracts sequences for all ortholog groups supplied as list. For each ID in the list
    a file containing FASTA sequences will be generated, whcih belong to that OG.
    Note: this script requires standalone blast+ software.
    input_ids_list 
        Input list should contain orthologous group IDs one per line
        These IDs should be generated by "orthomclMclToGroups" command
    database
        Absoulute path for the database should be specified. The database is 
        generally named as 'goodProteins.fasta'.
    -o directory_name
        directory name to save the output files. By default all files will be saved in 
        the current directory.
    -h, --help 
        Brings up this help page

Author
    Arun Seetharam, Bioinformatics Core, Purdue University.
    aseetharam@purdue.edu
EOF
}

if [ $# -lt 1 ] ; then
    printUsage
    exit 1
fi

while getopts ':o:' option; do
  case "$option" in
       o) outdir=$OPTARG
       shift
       ;;
       h) printUsage
       exit
       ;;
       help) printUsage
       exit
       ;;
  esac
done
#module use /apps/group/bioinformatics/modules # uncomment these 2 lines if running this on clusters
#module load blast
mkdir -p $outdir
shift $(( $# - 2 ))
file=${1}
pathdbname=${2}
sed -i 's/://g' ${file}
while IFS=$' ' read -r -a myArray
do
for i in "${myArray[@]:0}";
do
blastdbcmd -entry "$i" -db ${pathdbname} >> ${outdir}/${myArray[0]}.fa;
done
done <${file}

However when I run it, somehow, it changes the format of the initial list of orthologous groups to:

OG1.5_8000 sp1|Gene.10006gi sp2|Gene.21810GDRL01072110.1g.21810m.21810 sp3|Gene.49740GECV01105569.1g.49740m.49740 sp4|XP_018429109.1 sp5|Gene.23019GEGJ sp6|TRINITY_DN146402_c0_g3TRINITY_DN146402_c0_g3_i2g.23453m.23453

(colons ":" are ignored).

This causes that BLAST is not able to find the entry in BLAST database and so the script fails. However, I haven't wrote this piece of code and I have no idea on how to modify it to make it work. Could anyone please help me?

Thank you very much in advance!

bash orthomcl software error blast • 1.7k views

ADD COMMENT • link updated 6.0 years ago by ahmedmagds • 0 • written 7.1 years ago by biomonte ▴ 220

0

Entering edit mode

Hey Santiago, I was looking online for a a script to use it on a similar task you mentioned in this thread. I am trying using your script but it does not work for me. Actually the task I have is I have a list of OGs in a matrix. For example: [ OG1|Family1|D17P2#134peg1993 SCC393#128peg1816 Y4#137peg1426|chars 1-216 ] [ OG2|Family1|S23A#141peg1162 SCC4092#140peg898|chars 217-460 ] I have like 3000 OG in the matrix. I will extract them to be in a file and was thinking to use your script to go through the different OG and get me the CDS of these accession from a database folder having all taxa CDS (either in one file or separate files). The accession number are exactly the same. For example:

D17P2#134peg1993 AGCTGTGTAATG D17P2#134peg200 AGCTGTCC I do not need to do blast in my case as there is a direct correspondence. And then I want the script to save each OG (CDSs of diff taxa) in a separate file with same name OG1..etc so I can use it in a program like pal2nal. I am not familiar with coding so thats why I am having difficulty in modifying your script for this task. Help is really appreciated! Thanks! Ahmed

ADD REPLY • link 6.0 years ago by ahmedmagds • 0

score 0 · Answer 1 · 2017-03-17

0

Entering edit mode

7.1 years ago

biomonte ▴ 220

Alright... Sorry for the dummy question. I (almost) fixed myself by changing the sed search mode to:

sed -i 's/://' ${file}

It still gives an error about trying to find the term "OG1.5_8000" which obviously does not exist in my database. But it works :-)

ADD COMMENT • link 7.1 years ago by biomonte ▴ 220