Question

Creating a consensus based on 'x' number of fasta files

1

Entering edit mode

9.6 years ago

matthew.m.hernandez ▴ 30

I am looking for a way to create a consensus sequence based on the occurrence of individual sequences for a given tag. For example, I have the following FASTA sequences in one file for a given tag (i.e. CTAGGCAC). The number in the header refers to their occurrence.

>A_2795
TCAGAAAGAACCTC

>B_10
TCAGAAAGCACCTC

>C_3
TCTGAAAGCACTTC

>Consensus
TCAGAAAGAACCTC

Manually this is not a problem. However, it would be great to be able to make use of perl or bash (including bio packages) to create a consensus based on occurrence of each sequence. I appreciate anyone's thoughts/help.

Thanks!

alignment sequence perl bash • 2.7k views

ADD COMMENT • link updated 9.6 years ago by Pierre Lindenbaum 166k • written 9.6 years ago by matthew.m.hernandez ▴ 30

score 1 · Answer 1 · 2016-04-05

for fun, using Bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

echo -e ">A_2795\nTCAGAAAGAACCTC\n>B_10\nTCAGAAAGCACCTC\n>C_3\nTCTGAAAGCACTTC" | java -jar ~/src/jvarkit-git/dist/bioalcidae.jar -F fasta -e 'var consensus=[];while(iter.hasNext()) { var seq=iter.next();out.printlnseq.name+"\t"+seq.sequence);for(var i=0;i< seq.length();++i) {while(consensus.length <= i) consensus.push({}); var b = seq.charAt(i);if(b in consensus[i]) {consensus[i][b]++;} else {consensus[i][b]=1;} } } out.print("Cons.\t"); for(var i=0;i< consensus.length;i++) {var best=0,base="N"; for(var b in consensus[i]) {if(consensus[i][b]>best) { best= consensus[i][b];base=b;}} out.print(base);} out.println();' 


A_2795      TCAGAAAGAACCTC
B_10         TCAGAAAGCACCTC
C_3           TCTGAAAGCACTTC
Cons.        TCAGAAAGCACCTC