Question

Sorting BLAST output files together?

1

Entering edit mode

9.2 years ago

zgayk ▴ 90

Hi,

I have five BLASTn tabular files that resulted from querying the same large gene list (same query) against a different subject genome database for each resulting file. The goal was to potentially identify possible orthologous sequences between the subject gene list and the 5 different match genomes.

I was able to identify most, if not all of the same genes between each genome.

Now I would like to concatenate the five files together and sort them by the gene identifier name so that the sequences and names for the same gene across all five genomes are located in the same row of a different column. THe goal with this is that I can then extract the sequences for all five species across every gene for an alignment. I am working with a huge amount of genes here.

Is there a way to do this using cat and sort or would python work better? I am a bit clueless as to how to do this in python.

Thanks in advance, Zach

blast • 2.7k views

ADD COMMENT • link 9.2 years ago by zgayk ▴ 90

2

Entering edit mode

there a way to do this using cat and sort

yes, what have you tried ?

ADD REPLY • link 9.2 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

What I'm not sure is if a gene is missing from one of the blast files, but present in the other four, wouldn't the genes not all line up across all five species?

I have not actually used cat and sort, but have been reading that this might work. Would you have any ideas of a possible script?

ADD REPLY • link 9.2 years ago by zgayk ▴ 90

1

Entering edit mode

This is all I have done so far and it sorted all the sequences from the same species together, so I need to figure how to modify sort to sort by gene first, and then species. Not sure how to deal with the problem of genes missing in one genome, but present in the others.

cat outputExpandedPA.blast.txt outputExpandedGS.blast.txt outputExpandedGG.blast.txt outputExpandedFG.blast.txt outputExpandedCl.blast.txt > Combined.txt | sort

ADD REPLY • link 9.2 years ago by zgayk ▴ 90

1

Entering edit mode

Is the output in one of the tabular blast output formats? If not, doing a simple cat/sort will not work.

ADD REPLY • link 9.2 years ago by GenoMax 152k

1

Entering edit mode

I outputted the blast results in output format 7, the one that gives the actual sequences of both subject and match.

The sort worked, butI'm just not sure how to modify it to line up the sequences for each gene so I can extract the sequences for each species and then align the sequences.

ADD REPLY • link 9.2 years ago by zgayk ▴ 90

1

Entering edit mode

Have you tried any Bio-* parsers? - http://biopython.org/DIST/docs/tutorial/Tutorial.html - http://search.cpan.org/dist/BioPerl/Bio/SearchIO/blast.pm

ADD REPLY • link 9.2 years ago by Khader Shameer 18k

1

Entering edit mode

I am familiar with biopython, but have not used it for this task. Could you recommend a particular biopython function for this?

Thanks, Zach

ADD REPLY • link 9.2 years ago by zgayk ▴ 90