Question: Question: Converting Blast Results/Output to FASTA format
0
gravatar for apulunuj
8 days ago by
apulunuj0
apulunuj0 wrote:

Hello,

I am a graduate student with very little background in bioinformatics, and programming. Currently, I am trying to convert my blast output from a job script I ran on a cluster. The output:

Building a new DB, current time: 09/11/2017 21:50:46
New DB name:   /scratch/napulu/input_protdb/ref_prot.fasta
New DB title:  ref_prot.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /scratch/napulu/input_protdb/ref_prot.fasta
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 11 sequences in 0.00116897 seconds.

AcanGEN_gi|440795652|gb|ELR16769.1|     XP_002649088.2  VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK     77      455     379

AcanGEN_gi|440798889|gb|ELR19950.1|     XP_020438963.1  LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQGLKAISKKDKYDKIMEKKISTSVEVLCRNASFEFVTYLNYCRSLRFEDRPDYTYLRRLLKDLFIREGFTYDFLFDWTCVYASEKDKKKMLENKNRFDQTADQEGRDQR      113     322     210

Initially, I made a script. That split the line based on tab space. Then I used an if statement to see lines containing amino acids one letter code. Then I split those lines with ','. I wish to extract the query_id, and alignment to make the fasta file. The query_id is supposed to be new_line[0].

Does anyone know a better way of doing this? thank you.

alignment sequence python • 131 views
ADD COMMENTlink modified 7 days ago by kashifalikhan00740 • written 8 days ago by apulunuj0
1

Not exactly sure what you want to extract in terms of the ID but why not so something like: grep Acan blast_result | awk -F ' ' 'BEGIN {OFS="\n"}{print ">"$2,$3}' this will produce

>XP_002649088.2
VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK
>XP_020438963.1
LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQ
ADD REPLYlink written 8 days ago by genomax33k
1

show the blast command you used.

ADD REPLYlink written 8 days ago by st.ph.n1.6k

Here is my blast command.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLYlink written 7 days ago by apulunuj0

how did you captured the output ?

ADD REPLYlink written 7 days ago by Pierre Lindenbaum98k

Ok so it looks like you piped the output of both commands into the same file, or copied/paste right from the terminal. Good practice for a log, bad practice for parsing. If the blast result was a separate file, you would have a much easier time. Use -out blast.out.txt in your blast command to name the output file.

ADD REPLYlink modified 7 days ago • written 7 days ago by st.ph.n1.6k

please, show us the script, show us the desired output.

ADD REPLYlink written 8 days ago by Pierre Lindenbaum98k
#! usr/bin/env python   

import sys

infile =open('com_ortho_para.3248444.sched.txt')

for line in infile:

        line = line.split('     ')
        if (',') in line:
                new_line = line.split(',')
                query_id = new_line[0]
                subject_id = new_line[1]
                alignment = new_line[2]
                output.write('>' + query_id + '\n' + alignment) 
                print output

I am trying to making a fasta file in the format.

 '>' query_id
'alignment sequence'

Here is the blast output https://github.com/napulu123/Research-WGD/blob/master/com_ortho_para.3248444.sched.txt

ADD REPLYlink modified 8 days ago by genomax33k • written 8 days ago by apulunuj0

First, split by '\t', not ' '. Second, splitting by a ',' is probably a bad idea.

ADD REPLYlink written 7 days ago by st.ph.n1.6k

your output looks really badly folded ! what was your blast command ?

ADD REPLYlink written 7 days ago by Pierre Lindenbaum98k

Here it is.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLYlink modified 7 days ago • written 7 days ago by apulunuj0

The output was captured via a job script from which I ran the blast command. Here is the original script.

#!/bin/bash

#PBS -N transcriptome_blast
#PBS -q tiny12core
#PBS -j oe
#PBS -m abe
#PBS -o com_ortho_para.$PBS_JOBID.txt
#PBS -l nodes=1:ppn=12
#PBS -l walltime=00:00:10:00

cd $PBS_O_WORKDIR

module purge

module load blast

makeblastdb -in ref_prot.fasta -dbtype prot

blastp  -query transcriptome.fasta -db ref_prot.fasta  -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'
ADD REPLYlink modified 7 days ago by genomax33k • written 7 days ago by apulunuj0

warning:

=>> PBS: job killed: walltime 623 exceeded limit 600
PBS Job Statistics:
PBS Job ID: 3248444.sched
ADD REPLYlink written 7 days ago by Pierre Lindenbaum98k
1
gravatar for Pierre Lindenbaum
7 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

Here is my blast command. blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

use the option -out fileout.tsv to capture the output in a file. and then use @genomax's awk script.

EDIT: or pipe blastp into genomax's awk and redirect the output

blastp (...) | awk (...) > out.txt
ADD COMMENTlink modified 7 days ago • written 7 days ago by Pierre Lindenbaum98k
0
gravatar for kashifalikhan007
7 days ago by
Cologne
kashifalikhan00740 wrote:
grep "^Acan" ur_blast.output | awk '{print">query: "$1"\tHit: "$2"\tscore: "$4":"$5":"$6"\n"$3}' > ur_result.output
ADD COMMENTlink written 7 days ago by kashifalikhan00740
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1389 users visited in the last hour