Question: Question: Converting Blast Results/Output to FASTA format
0
gravatar for apulunuj
5 months ago by
apulunuj0
apulunuj0 wrote:

Hello,

I am a graduate student with very little background in bioinformatics, and programming. Currently, I am trying to convert my blast output from a job script I ran on a cluster. The output:

Building a new DB, current time: 09/11/2017 21:50:46
New DB name:   /scratch/napulu/input_protdb/ref_prot.fasta
New DB title:  ref_prot.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /scratch/napulu/input_protdb/ref_prot.fasta
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 11 sequences in 0.00116897 seconds.

AcanGEN_gi|440795652|gb|ELR16769.1|     XP_002649088.2  VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK     77      455     379

AcanGEN_gi|440798889|gb|ELR19950.1|     XP_020438963.1  LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQGLKAISKKDKYDKIMEKKISTSVEVLCRNASFEFVTYLNYCRSLRFEDRPDYTYLRRLLKDLFIREGFTYDFLFDWTCVYASEKDKKKMLENKNRFDQTADQEGRDQR      113     322     210

Initially, I made a script. That split the line based on tab space. Then I used an if statement to see lines containing amino acids one letter code. Then I split those lines with ','. I wish to extract the query_id, and alignment to make the fasta file. The query_id is supposed to be new_line[0].

Does anyone know a better way of doing this? thank you.

alignment sequence python • 301 views
ADD COMMENTlink modified 5 months ago by kashifalikhan00740 • written 5 months ago by apulunuj0
1

Not exactly sure what you want to extract in terms of the ID but why not so something like: grep Acan blast_result | awk -F ' ' 'BEGIN {OFS="\n"}{print ">"$2,$3}' this will produce

>XP_002649088.2
VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK
>XP_020438963.1
LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQ
ADD REPLYlink written 5 months ago by GenoMax42k
1

show the blast command you used.

ADD REPLYlink written 5 months ago by st.ph.n2.1k

Here is my blast command.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLYlink written 5 months ago by apulunuj0

how did you captured the output ?

ADD REPLYlink written 5 months ago by Pierre Lindenbaum104k

Ok so it looks like you piped the output of both commands into the same file, or copied/paste right from the terminal. Good practice for a log, bad practice for parsing. If the blast result was a separate file, you would have a much easier time. Use -out blast.out.txt in your blast command to name the output file.

ADD REPLYlink modified 5 months ago • written 5 months ago by st.ph.n2.1k

please, show us the script, show us the desired output.

ADD REPLYlink written 5 months ago by Pierre Lindenbaum104k
#! usr/bin/env python   

import sys

infile =open('com_ortho_para.3248444.sched.txt')

for line in infile:

        line = line.split('     ')
        if (',') in line:
                new_line = line.split(',')
                query_id = new_line[0]
                subject_id = new_line[1]
                alignment = new_line[2]
                output.write('>' + query_id + '\n' + alignment) 
                print output

I am trying to making a fasta file in the format.

 '>' query_id
'alignment sequence'

Here is the blast output https://github.com/napulu123/Research-WGD/blob/master/com_ortho_para.3248444.sched.txt

ADD REPLYlink modified 5 months ago by GenoMax42k • written 5 months ago by apulunuj0

First, split by '\t', not ' '. Second, splitting by a ',' is probably a bad idea.

ADD REPLYlink written 5 months ago by st.ph.n2.1k

your output looks really badly folded ! what was your blast command ?

ADD REPLYlink written 5 months ago by Pierre Lindenbaum104k

Here it is.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLYlink modified 5 months ago • written 5 months ago by apulunuj0

The output was captured via a job script from which I ran the blast command. Here is the original script.

#!/bin/bash

#PBS -N transcriptome_blast
#PBS -q tiny12core
#PBS -j oe
#PBS -m abe
#PBS -o com_ortho_para.$PBS_JOBID.txt
#PBS -l nodes=1:ppn=12
#PBS -l walltime=00:00:10:00

cd $PBS_O_WORKDIR

module purge

module load blast

makeblastdb -in ref_prot.fasta -dbtype prot

blastp  -query transcriptome.fasta -db ref_prot.fasta  -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'
ADD REPLYlink modified 5 months ago by GenoMax42k • written 5 months ago by apulunuj0

warning:

=>> PBS: job killed: walltime 623 exceeded limit 600
PBS Job Statistics:
PBS Job ID: 3248444.sched
ADD REPLYlink written 5 months ago by Pierre Lindenbaum104k
1
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum104k wrote:

Here is my blast command. blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

use the option -out fileout.tsv to capture the output in a file. and then use @genomax's awk script.

EDIT: or pipe blastp into genomax's awk and redirect the output

blastp (...) | awk (...) > out.txt
ADD COMMENTlink modified 5 months ago • written 5 months ago by Pierre Lindenbaum104k
0
gravatar for kashifalikhan007
5 months ago by
Cologne
kashifalikhan00740 wrote:
grep "^Acan" ur_blast.output | awk '{print">query: "$1"\tHit: "$2"\tscore: "$4":"$5":"$6"\n"$3}' > ur_result.output
ADD COMMENTlink written 5 months ago by kashifalikhan00740
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 965 users visited in the last hour