Question

Processing corresponding fasta records from separate multi-fasta files

0

Entering edit mode

3.9 years ago

Digsby ▴ 10

I am trying to align corresponding fasta records from separate multi-fasta files which all contain the same number of fasta records. Each multi-fasta file contains ordered orthologous nt sequences. The format is as follows:

Multi-fasta for strain 1:

    >Strain1_ortholog1
    ATGC
    >Strain1_ortholog2
    GACT

Multi-fasta for strain 2:

    >Strain2_ortholog1
    ATGC
    >Strain2_ortholog2
    GATT

I have 21 strains, where each strain's multi-fasta file contains 592 ordered orthologs, and I would like my output to be strain-specific aligned multi-fasta files (i.e. the fasta sequences should contain gaps where appropriate). I am wondering if there is a good script/tool I can use to accomplish this. Thanks for any input you can provide!

alignment • 740 views

ADD COMMENT • link 3.9 years ago by Digsby ▴ 10

1

Entering edit mode

I would like my output to be strain-specific aligned multi-fasta files

If you have multi-fasta files that are strain specific they can directly go into a MSA program.

If you wanted all ortholog_1 to go in one file then: a putative workflow. Split the files per strain for each ortholog (faSplit from Jim Kent can be one option). cat all ortholog_1 files together and then do MSA for each of these files.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

"If you have multi-fasta files that are strain specific they can directly go into a MSA program."

Wouldn't most MSA programs align all the records in a single file, rather than aligning corresponding sequences between files?

ADD REPLY • link 3.9 years ago by Digsby ▴ 10

0

Entering edit mode

It was not clear from your original question since you had the part I had quoted in my last comment. You already appear to have strain specific files based on the example posted.

aligning corresponding sequences between files?

If you mean that all ortholog_1 from different strains need to go in one alignment then follow the second workflow I proposed above. This assumes that Strain1_ortholog_1 directly corresponds to Strain2_ortholog_1 and so on.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Assuming that OP requirement is to ortholog specific MSA:

List headers from a single file (assuming that all files have same number of strains and share same names) and extract ortholog part.
Using this file, query all the fasta files serially or in parallel for each strain ortholog (from step 1) and output each strain query output in individual strain ortholog fasta files.
subject each strain ortholog sequences to MSA serially or in parallel with same parameters.

If OP requirement is to have strain specific MSA:

Identify the parameters of MSA including program.
Write a script to execute MSA program with parameters in step 1, on each strain specific fasta file either serially or in parallel and write output to strain specific outputs. OP can use bash loop or GNU-parallel to achieve this.

ADD REPLY • link 3.9 years ago by cpad0112 21k