Question: Question: Converting Blast Results/Output to FASTA format
2
gravatar for apulunuj
3.2 years ago by
apulunuj30
apulunuj30 wrote:

Hello,

I am a graduate student with very little background in bioinformatics, and programming. Currently, I am trying to convert my blast output from a job script I ran on a cluster. The output:

Building a new DB, current time: 09/11/2017 21:50:46
New DB name:   /scratch/napulu/input_protdb/ref_prot.fasta
New DB title:  ref_prot.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /scratch/napulu/input_protdb/ref_prot.fasta
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 11 sequences in 0.00116897 seconds.

AcanGEN_gi|440795652|gb|ELR16769.1|     XP_002649088.2  VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK     77      455     379

AcanGEN_gi|440798889|gb|ELR19950.1|     XP_020438963.1  LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQGLKAISKKDKYDKIMEKKISTSVEVLCRNASFEFVTYLNYCRSLRFEDRPDYTYLRRLLKDLFIREGFTYDFLFDWTCVYASEKDKKKMLENKNRFDQTADQEGRDQR      113     322     210

Initially, I made a script. That split the line based on tab space. Then I used an if statement to see lines containing amino acids one letter code. Then I split those lines with ','. I wish to extract the query_id, and alignment to make the fasta file. The query_id is supposed to be new_line[0].

Does anyone know a better way of doing this? thank you.

alignment sequence python • 1.7k views
ADD COMMENTlink modified 3.2 years ago by kashiff007130 • written 3.2 years ago by apulunuj30
1

Not exactly sure what you want to extract in terms of the ID but why not so something like: grep Acan blast_result | awk -F ' ' 'BEGIN {OFS="\n"}{print ">"$2,$3}' this will produce

>XP_002649088.2
VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK
>XP_020438963.1
LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQ
ADD REPLYlink written 3.2 years ago by genomax92k
1

show the blast command you used.

ADD REPLYlink written 3.2 years ago by st.ph.n2.5k

Here is my blast command.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLYlink written 3.2 years ago by apulunuj30

how did you captured the output ?

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum131k

Ok so it looks like you piped the output of both commands into the same file, or copied/paste right from the terminal. Good practice for a log, bad practice for parsing. If the blast result was a separate file, you would have a much easier time. Use -out blast.out.txt in your blast command to name the output file.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by st.ph.n2.5k

please, show us the script, show us the desired output.

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum131k
#! usr/bin/env python   

import sys

infile =open('com_ortho_para.3248444.sched.txt')

for line in infile:

        line = line.split('     ')
        if (',') in line:
                new_line = line.split(',')
                query_id = new_line[0]
                subject_id = new_line[1]
                alignment = new_line[2]
                output.write('>' + query_id + '\n' + alignment) 
                print output

I am trying to making a fasta file in the format.

 '>' query_id
'alignment sequence'

Here is the blast output https://github.com/napulu123/Research-WGD/blob/master/com_ortho_para.3248444.sched.txt

ADD REPLYlink modified 3.2 years ago by genomax92k • written 3.2 years ago by apulunuj30

First, split by '\t', not ' '. Second, splitting by a ',' is probably a bad idea.

ADD REPLYlink written 3.2 years ago by st.ph.n2.5k

your output looks really badly folded ! what was your blast command ?

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum131k

Here it is.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by apulunuj30

The output was captured via a job script from which I ran the blast command. Here is the original script.

#!/bin/bash

#PBS -N transcriptome_blast
#PBS -q tiny12core
#PBS -j oe
#PBS -m abe
#PBS -o com_ortho_para.$PBS_JOBID.txt
#PBS -l nodes=1:ppn=12
#PBS -l walltime=00:00:10:00

cd $PBS_O_WORKDIR

module purge

module load blast

makeblastdb -in ref_prot.fasta -dbtype prot

blastp  -query transcriptome.fasta -db ref_prot.fasta  -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'
ADD REPLYlink modified 3.2 years ago by genomax92k • written 3.2 years ago by apulunuj30

warning:

=>> PBS: job killed: walltime 623 exceeded limit 600
PBS Job Statistics:
PBS Job ID: 3248444.sched
ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum131k
1
gravatar for Pierre Lindenbaum
3.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

Here is my blast command. blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

use the option -out fileout.tsv to capture the output in a file. and then use @genomax's awk script.

EDIT: or pipe blastp into genomax's awk and redirect the output

blastp (...) | awk (...) > out.txt
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Pierre Lindenbaum131k
0
gravatar for kashiff007
3.2 years ago by
kashiff007130
Cologne
kashiff007130 wrote:
grep "^Acan" ur_blast.output | awk '{print">query: "$1"\tHit: "$2"\tscore: "$4":"$5":"$6"\n"$3}' > ur_result.output
ADD COMMENTlink written 3.2 years ago by kashiff007130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1019 users visited in the last hour