Question: Multi-Sequence Alignment For Many Groups Of Genes
0
gravatar for biolab
6.5 years ago by
biolab1.2k
biolab1.2k wrote:

Dear all, How to make muli-sequence alingment for many groups of genes? I make an example here.

>gene1_human
ATTTGCGTGACTGACTGC
>Gene2_human
GCGCGCATGATCCGATGACTG
>gene3_human
TGATACGATGCTGACTGACTGAC
......

>gene1_fly
ATTTGCGTGACTCTGC
>Gene2_fly
GCGCATGATCCGATGACTG
>gene3_fly
TGATACGATGCTGACTGTGAC
......

>gene1_worm
ATTTGCGTGACTCTGaC
>Gene2_worm
GCGCATGATCCGATGccACTG
>gene3_worm
TgtgGATACGATGCTGACTGTGAC
...

>gene1_mouse
ATTTGCGTGACTCTGaC
>Gene2_mouse
GCGCATGATCCGATGccACTG
>gene3_mouse
TgtgGATACGATGCTGACTGTGAC
......

I need to separately compare each gene from these species. The output likes below. They have the same length with gaps marked as - Does anyone know how to perform this analysis? Please give me some suggestions. Thank you very much!

Human   ATTTGCGTGACTGACTG-C
Mouse   ATTTGCGTGACT--CTGAC
Worm    ATTTGCGTGACT--CTGAC
Fly     ATTTGCGTGACT--CTG-C
alignment • 1.7k views
ADD COMMENTlink modified 6.5 years ago by Pavel Senin1.9k • written 6.5 years ago by biolab1.2k
1
gravatar for Pavel Senin
6.5 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

Will clustalw work for you?

CLUSTAL 2.1 multiple sequence alignment


gene1_worm       ----ATTTGCGTGACT--CTGAC----
gene1_mouse      ----ATTTGCGTGACT--CTGAC----
gene1_fly        ----ATTTGCGTGACT--CTGC-----
gene1_human      ----ATTTGCGTGACTGACTGC-----
gene3_fly        ----TGATACGATGCTGACTG--TGAC
gene3_worm       -TGTGGATACGATGCTGACTG--TGAC
gene3_mouse      -TGTGGATACGATGCTGACTG--TGAC
gene3_human      ----TGATACGATGCTGACTGACTGAC
Gene2_human      GCGCGCATGATCCGATGACTG------
Gene2_fly        --GCGCATGATCCGATGACTG------
Gene2_worm       --GCGCATGATCCGATGCCACTG----
Gene2_mouse      --GCGCATGATCCGATGCCACTG----
                       :*..   ..*  *:

edit: yes, MUSCLE is another option, especially, it the sequences vary in their length

Gene2_human      ---GCGCGCA----TGATCCGATG--ACTG
Gene2_fly        -----GCGCA----TGATCCGATG--ACTG
Gene2_worm       -----GCGCA----TGATCCGATGCCACTG
Gene2_mouse      -----GCGCA----TGATCCGATGCCACTG
gene3_human      ---TGATACGATGCTGACTGACTG--AC--
gene3_worm       TGTGGATACGATGCTGACTG--TG--AC--
gene3_mouse      TGTGGATACGATGCTGACTG--TG--AC--
gene3_fly        ---TGATACGATGCTGACTG--TG--AC--
gene1_human      ---ATTTGCG----TGACTGACTG---C--
gene1_fly        ---ATTTGCG----TGACTC--TG---C--
gene1_worm       ---ATTTGCG----TGACTC--TG--AC--
gene1_mouse      ---ATTTGCG----TGACTC--TG--AC--
                         *     ***     **   *
ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Pavel Senin1.9k

Thanks! I just worry that I have many genes, when clustering all sequences together, some of them may not be well aligned. for example, the last C in fly and human gene1 are not aligned very well. Ideally I can run multi-sequence alignment for each gene in batch. What's your ideas? I don't know the CLUSAL algorithm.

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by biolab1.2k

clustering, as the process, is somewhat different from multiple alignment. you can pre-process you large dataset (how many sequences?) with cd-hit - which will cluster, i.e. partition, it (can handle large data) - then you can apply clustalw to clusters in order to obtain multiple alignments

ADD REPLYlink written 6.5 years ago by Pavel Senin1.9k

Hi Pavel, I have ~3000 sequences. What tools for the cd-hit command? Could you please say a liitle bit more on the partition or what tools can be used for pre-process so many sequences? One of my further question is that can I run CLUSTALX(W) by command? In this way I can run it 3000 times. Thank you very much!

ADD REPLYlink written 6.5 years ago by biolab1.2k

You could try to run clastalw on your set of 3K sequences as it is, I guess. CD-HIT is the software used to partition (cluster) a large dataset into groups of similar sequences. Configured by a similarity threshold and other parameters you can adjust the way it makes those groups. Assuming that sequences within a group (a cluster) are similar - clustalw will perform multiple sequence alignment for them significantly faster. Yes, you can install clustalw on your computer, and you don't need to run it 3000 times, just once.

ADD REPLYlink written 6.5 years ago by Pavel Senin1.9k

Hi Pavel, I have one more question. I have just installed Clustalw. However, after typing clustalw I found a file input window pop up, then another window pop up. It's a step-by-step mode. How can I run it once by command? I can prepare 3000 gene sequence files, but don't know how to run clustalw in batch? When you have free time, could you please write to me a command for batch running clustalw (default parameters are ok)? Thank you in advance!!

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by biolab1.2k
1

IMO muscle is way better than clustal..

ADD REPLYlink written 6.5 years ago by 5heikki8.9k

sure! let's put that example too.

ADD REPLYlink written 6.5 years ago by Pavel Senin1.9k

Of course, these algorithms are designed for homologs, somehow I'm getting the idea that you're trying to align non-related sequences, which wouldn't make any sense in almost any context. More than that, if these are protein-coding genes, you should be aligning amino acids instead of nucleotides..

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by 5heikki8.9k

Hi 5heikki, I need to align nucleotide sequences rather than protein sequences. I am using targetscan to find miRNA targets in various species. The input file should be mutisequence alignment. Thanks!

ADD REPLYlink written 6.5 years ago by biolab1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1710 users visited in the last hour