Question: Blast+ results file parsing to fasta file
1
gravatar for ahmedakhokhar
4.2 years ago by
ahmedakhokhar110
Belgium
ahmedakhokhar110 wrote:

Dear all, 

I'm new to this forum and new to computational analysis, I'm using standalone NCBI Blast+ (blastp) for the first time, and I have the results file in the following format: 

Query= Y

Length=6

Subject= X 

Length=739


 Score = 15.4 bits (28),  Expect = 0.044, Method: Composition-based stats.
 Identities = 5/6 (83%), Positives = 6/6 (100%), Gaps = 0/6 (0%)

Query  1    DDDIPF  6
            D+DIPF
Sbjct  244  DNDIPF  250

 

But I want to do a multiple alignments of all the hits, for that, I need to extract the sequences in the following fasts format: 

>Subject= X

Sbjct  244  DNDIPF

Is there any tool which is helpful to do direct multiple alignments from blast results files or a tool/tutorial to extract the sequences in fasta format to process further. Thanks. 

blast sequence python ncbi • 5.2k views
ADD COMMENTlink modified 4.2 years ago by Bara'a230 • written 4.2 years ago by ahmedakhokhar110
5
gravatar for Siva
4.2 years ago by
Siva1.6k
United States
Siva1.6k wrote:

Run 'blastp' with the following outfmt option

-outfmt '6 qseqid sseqid sseq'

qseqid means Query Seq-id

sseqid means Subject Seq-id

sseq means Aligned part of subject sequence

Then use awk to write the hit sequence in FASTA format

awk 'BEGIN { OFS = "\n" } { print ">"$2, $3 }' <YOUR_BLAST_RESULT_FILE>

Are you sure you want to align the HSPs not the full sequences of BLAST hits? What is your goal?

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Siva1.6k

Dear Siva, my final goal is to take the "subject" sequences from the blastp report (results) and do multiple alignment to the see their evolution patterns. 

I'm trying to do blast+ against my own database and wanted to take these sequences only. 

ADD REPLYlink written 4.2 years ago by ahmedakhokhar110
2

If you only take the aligned portion of the subject sequences from the BLAST result, you might end up aligning different segments of the same sequence against one another. To avoid that you also need to take in to account the aligning regions of the query sequence.

The simpler way would be to align the full length sequences of the BLAST hits. You can get the list of all Subject sequence IDs by using any of the following options for -outfmt

sallgi means All subject GIs

sallseqid means All subject Seq-id(s), separated by a ';'

sallacc means All subject accessions

Once you have the list of sequence IDs or GIs, you can use 'blastdbcmd' tool to extract the full length sequences from your BLASTdb, provided it was created with the option '-parse_seqids'. Then, you can continue with multiple alignment and other evolutionary analyses.

If you don't know already, you will find answers for most of your BLAST related questions here:

BLAST Command Line Applications User Manual

If you still want to go with getting multiple alignment from BLAST results, there is a tool called Blammer from HHsuite that converts BLAST/PSI-BLAST result in to multiple alignments. Though I have never used it.

ADD REPLYlink written 4.2 years ago by Siva1.6k

Thanks Siva, this worked exactly the way I anticipated but 

{ print ">"$2, $3 }

failed to write the data into the file, ( $2, $3 } ) remain empty, any ideas ? 

ADD REPLYlink written 4.2 years ago by ahmedakhokhar110
1

As Ram suggested, please post the full BLAST and awk commands you used. If you have multiple HSPs from one sequence (different regions of the same sequence), using the awk command I posted above will result in multiple sequences with same FASTA headers. Multiple sequence alignment programs expect unique sequence identifiers.
 

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Siva1.6k

Dear Siva & Ram,

I have used the following commends: 

# to get blastp result fle:

blastp -query file.txt -subject file.fasta -outfmt '6 qseqid sseqid sseq evalue' -out file_align.txt

# to write the data.file in fasta :

awk 'BEGIN { OFS = "\n"} {print ">" $1 $2, $3}' file_align.txt > file2.txt

Thanks for the help, Still, I have one more question, I want to restrict my blastp results to evalue<0.001, and I tried to do in with -outfmt ''-outfmt '6 qseqid sseqid sseq evalue<0.001", but this filter doesn't seem to work, any idea ?? 

 

ADD REPLYlink modified 4.2 years ago by RamRS20k • written 4.2 years ago by ahmedakhokhar110
1

For the awk part, try specifying the separator explicitly using awk -F "\t" 'BEGIN ... 

In the second part, where you're trying to filter, you're adding the filter to the formatting option, which is incorrect. You might wanna add the evalue filter option with -evalue 0.00099 to the blastp command

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by RamRS20k

Thank you RamRS, This is what I say looking for. Now my sequences are ready for MSA. 

ADD REPLYlink written 4.2 years ago by ahmedakhokhar110

Could you maybe give you the exact full command you used? That might help!

ADD REPLYlink written 4.2 years ago by RamRS20k
1
gravatar for Jean-Karim Heriche
4.2 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

You could parse blastp's tabular output using your favorite scripting language and output data in fasta format or use XSLT to convert blastp's xml output to fasta (to get started, have a look at Pierre Lindenbaum's stylesheet here).

ADD COMMENTlink written 4.2 years ago by Jean-Karim Heriche18k
0
gravatar for Bara'a
4.2 years ago by
Bara'a230
Amman - Jordan
Bara'a230 wrote:

@ahmedakhokhar : I think the first portion of your question has already been answered by colleagues , but I can help you though if you still have problems retrieving the hit sequences.

Regarding the second portion of your question, I can recommend you Clustal Omega if you are dealing with extremely large sequence files, which outperforms ClustalW2 in efficiency and memory requirements. It's a quite easy to use Global Multiple Alignment tool with useful features such as the calculation of pairwise distance matrix that represents the evolutionary distance between each pair of sequences. You can refer to the official site for more information : http://www.clustal.org/

Please let me know if you need a hand in your script, I will be happy to help.

ADD COMMENTlink modified 4.2 years ago by RamRS20k • written 4.2 years ago by Bara'a230
Sorry for the untidy answer , I don't know what's wrong with my browser :(
ADD REPLYlink written 4.2 years ago by Bara'a230
1

No worries, I cleaned it up for you :)

ADD REPLYlink written 4.2 years ago by RamRS20k
Well thank you so much for that @RamRS :D
ADD REPLYlink written 4.2 years ago by Bara'a230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 715 users visited in the last hour