How to align huge number of sequences to a reference sequence
2
0
Entering edit mode
2.4 years ago

I am trying to align large number of sequences (~ 5 million) for mutation analysis. I have a reference sequence against which I need to align. Instead of performing multiple sequence alignment, which looks nigh impossible, can I perform pairwise alignments of each sequence with the reference sequence?

reference sequence fasta alignment align • 4.5k views
ADD COMMENT
0
Entering edit mode

what kind of sequences do you have? (long? short?) Why do you want/think you need to use MSA ? and lastly, what have you tried/looked for already?

(aka, provide more info to get more meaningful replies)

ADD REPLY
0
Entering edit mode

I have protein sequences in fasta format. They correspond to a viral protein whose length is around 1000 amino acids. To clarify, I do not need MSA here. I have the wild type sequence and want to align all sequences wrt it. I had tried MUSCLE previously but that did not work here

ADD REPLY
0
Entering edit mode

none of the 'classical' aligners will be able to do this (align protein to DNA). They all work with same type of sequences.

what I think you are looking for is protein-mappers (rather in the gene prediction area than aligner area), perhaps things like genewise, genomethreader, ... might be of use.

ADD REPLY
0
Entering edit mode

Actually by wild type sequence I meant protein sequence and not the DNA/RNA. Sorry for the confusion. I am indeed mapping protein to protein here.

ADD REPLY
1
Entering edit mode

ok, makes more sense indeed.

Did you try the new clustal omega ? it should be able to take a huge number of input sequences to perform alignment with.

As an alternative, you can try to do progressive alignment, as in: take the first 100 sequences, align those, get the result and use that to align the next 100 to it. Tools such as t-coffee, muscle, emma, ... can also take a profile (==set of already aligned sequences) as input, as well as a fasta file.

ADD REPLY
2
Entering edit mode
2.4 years ago
5heikki 11k
for f in $(find /place/with/5mseqs -maxdepth 1 -type f -name "*.fasta"); do
    alignProgram -input "$f" -reference /place/with/reference/genome.fasta -output ref_vs_$(basename "$f")
done
ADD COMMENT
0
Entering edit mode

The aligner program I am using (MUSCLE) does not have a parameter for reference sequence. Is it possible to design a script that runs in this manner:

  1. Pick the first sequence from my 50 million ones.
  2. Create a temporary fasta file of 2 sequences, first the ref then the picked sequence.
  3. Align these two which creates an output (temp.out). This way the sequence is forced to align only to the ref.
  4. Extract the aligned sequence (this will be the second sequence in temp.out) and paste it to the final output (output_main)

In this way run the loop for all sequences, each time aligning with the ref sequence and adding the aligned sequence to the main file (output_main).

Apologies if this seems unclear or too contrived.

ADD REPLY
1
Entering edit mode

Example sequences:

>REF
TAAATGGGC

>SEQ1
TAAAATGGGC

>SEQ2
TAAATGGGGC

>SEQ3
TAAAATGGGGC

Alignments:

REF  TAAA-TGGGC
SEQ1 TAAAATGGGC

REF  TAAATGGG-C
SEQ2 TAAATGGGGC

REF  TAAA-TGGG-C
SEQ3 TAAAATGGGGC

Final output:

SEQ1 TAAAATGGGC
SEQ2 TAAATGGGGC
SEQ3 TAAAATGGGGC

I'm pretty sure that this is not what you actually want..

ADD REPLY
0
Entering edit mode

Actually this is exactly what I need. Since my ref is the WT protein, there is hardly any chance there will be any gaps in the ref sequence when aligned. The remaining sequences are equal or shorter in length to the ref sequence, so after alignment, the query sequences may incorporate 1 or more gaps in the alignment. In your example, the ref is shorter than seq 1-3. However the final output is what I'm looking for.

I'm not very proficient in Linux. Could you help me devise a code for this?

The final output should be in fasta as well:

Seq1 TAAAATGGGC Seq2 TAAATGGGGC Seq3 TAAAATGGGGC

Actually MUSCLE by default outputs the alignments in fasta format. All I have to do is pick the second alignment (query) and paste it to the final output.

ADD REPLY
1
Entering edit mode
cat ref.fna
>REF
TAAATGGGC

cat all.fna
>SEQ1
TAAAATGGGC
>SEQ2
TAAATGGGGC
>SEQ3
TAAAATGGGGC

export IFS=$'\n'; while read l1; read l2; do muscle -in <(cat ref.fna <(echo -e "$l1\n$l2")) -out alignment.fna; tail -n2 alignment.fna >> finalAlignment.fna; done<all.fna

cat finalAlignment.fna
>SEQ1
TAAAATGGGC
>SEQ2
TAAATGGGGC
>SEQ3
TAAAATGGGGC

Notice that we assume here that there are no linebreaks in any sequences. If there are, this will not work. If muscle creates linebreaks, this will not work. So first you need to linearize your input sequences and then if muscle creates linebreaks you also need to linearize alignment.fna before the tail command

If muscle doesn't create linebreaks you can skip creating alignment.fna, i.e.:

export IFS=$'\n'; while read l1; read l2; do muscle -in <(cat ref.fna <(echo -e "$l1\n$l2")) -out /dev/stdout | tail -n2 >> finalAlignment.fna; done<all.fna

Oh also here I assume that muscle doesn't change the order of input sequences. This was the case with the example data. I don't know if it's the case with your data. Edit. Actually, muscle can change input order. It used to have an option -stable which prevented this, but it doesn't currently have that. So you are probably better off creating alignment.fna and then parsing your non-reference sequence from that with e.g. awk to the finalAlignment.fna file..

ADD REPLY
0
Entering edit mode

technically this will work indeed, but I fear a bit it might not make biological sense in the end. Final result will greatly depend on the order in which you align them.

Perhaps first calculating a distance matrix and then align them using the distance info might lead to better final result, but still ...

ADD REPLY
0
Entering edit mode

Thank you so much

ADD REPLY
0
Entering edit mode

For some reason I'm getting an error as:

Unexpected '>' in FASTA sequence data tail: cannot open 'alignment.fna' for reading: No such file or directory

I ran the exact same files with same filenames as shown in here.

Update: This has been solved. The lack of a line at the end of the sequence was causing the problem. All I had to do was press "Enter" and save.

ADD REPLY
1
Entering edit mode
2.4 years ago
jv ★ 1.8k

I was actually doing something very similar myself recently and settled on pairwise blastp alignments with the reference sequence as the subject sequence. All 5 million of your query sequences can be in a single fasta file as input making it straight forward to run. One of the advantages of blastp over say MUSCLE or clustal is the use of the blosum matrix for aiding in alignment - only blastp gave the expected result for some harder to align areas of my proteins.

You also have a lot of control over the format of the output from blastp which can aid in the downstream analysis of the results.

ADD COMMENT
0
Entering edit mode

I would absolutely not advice this!!

Blast is NOT an alignment program/tool, it's a search tool! Moreover it is at best a local aligner while what you are looking for here is a global aligner (very different implementation than local)

And again, as said elsewhere here, doing progressive pairwise alignment will not always give you the correct final result.

Also the blosum argument is not valid as also the other aligners use such matrices to calculate their alignment.

ADD REPLY
0
Entering edit mode

I see your point, but in my particular use case blast was the best option and did provide mostly full-length pairwise alignments but in my case the proteins I was aligning were all closely related. You can adjust parameters to tailor the alignment results to meet your needs, but you are right to point out that Blast is a local aligner first and foremost.

ADD REPLY

Login before adding your answer.

Traffic: 2889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6