Trim sequences based on alignment in python
1
1
Entering edit mode
2.9 years ago
Shred ▴ 460

I'm trying to edit an MSA (Multiple Sequence Alignment) file generated by ClustalW, to trim sequences before the consensus one, using BioPython. xxx refers to other bases not relevant here

Here's the example I/O :

INPUT

ITS_primer_fw               --------------------------------CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CCACCCCAACAAGGGCGGCCACGCGGTCCGCTCGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxx
PRL2010                     ACACCCCCGAAAGGGCGTCC------CCTGCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxx
BBF32_3                     ACACACCCACAAGGGCGAGCAGGCG----GCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxx
BBFCG32                     CAACACCACACCGGGCGAGCGGG-------CTCGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxx


EXPECTED OUTPUT

ITS_primer_fw               CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxx
PRL2010                     CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxx
BBF32_3                     CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxx
BBFCG32                     CGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxx


The documented code for AlignIO describes just a way to extract sequences by treating the alignment as an array. In this example

align = AlignIO.read(input_file, "clustal")
sub_alignment = align[:,20:]


I was able to extract a subalignment made by all the sequences (:) starting from the 20th nucleotide. I'm looking for a way to replace the 20 in the example with the position of the first nucleotide of the consensus sequence.

Any answers including some cline software to trim easly as requested are well accepted. Will be great if python coded or for UNIX.

biopython python alignment • 2.1k views
0
Entering edit mode

By consensus, do you mean you want the first position that is fully occupied in all columns? A true consensus sequence would have characters from the start of the alignment, so defining it from the "first character of the concensus" would just give you the data you already have.

0
Entering edit mode

Yes, first position occupied in all the columns.

0
Entering edit mode

Can you provide you actual alignment please, your example is good for explaining, but not so much for testing.

0
Entering edit mode

Got just the alignment done against the first two sequences.

CLUSTAL 2.1 multiple sequence alignment

ITS_primer_fw               --------------------------------------------------
RBL67ITS_full_sequence      CCCACACACGGGGAACCCGCAAGACCAAACGAACACCGGACGACAGATCA
PRL2010                     ----CGCACGACCCATCCG--GGAACGGCGGGAAACCGCAAGCCC-ACGG

ITS_primer_fw               --------------------------------------------------
RBL67ITS_full_sequence      TCAACACTCTGC-GATCACA-AAACGATCAAACAAACAACCAGACCCCCA
PRL2010                     CAAAAACTCCGCTGGCAACACAAATCATCACACAAATGATCACACACAAA

ITS_primer_fw               --------------------------------------------------
RBL67ITS_full_sequence      A--AACGGGGACCCGGTCGAAATACATCAGTACAAGCAGACCACAACCAG
PRL2010                     ACGATCAAACACCAGACTCCGAAGAGTCCGGCGAAATCAACAGCAGA-AG

ITS_primer_fw               --------------------------------CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CCACCCCAACAAGGGCGGCCACGCGGTCCGCTCGCGTCCACTCTCCAGTT
PRL2010                     ACACCCCCGAAAGGGCGTCC------CCTGCTCGCGTCCACTATCCAGTT
********** *******

ITS_primer_fw               CTC-----------------------------------------------
RBL67ITS_full_sequence      CTCAAACCACCACGCCACACCACACACCCCCACCCGGCG-ACAACCGGAC
PRL2010                     CTCAAACCACCACGCACCCGCACGCGAGCCCACCCACCCCACCAGGGGAC
***

ITS_primer_fw               --------------------------------------------------
RBL67ITS_full_sequence      AGGAACCCATGGCGGGCACCCGAACGCCAACCCCCCAACACTAGGAGGGC
PRL2010                     GGGCGACCGGCACGGGGCACGGAACACCAGCCACGCCAAACGGCGCGGG-

ITS_primer_fw               --------------------------------------------------
RBL67ITS_full_sequence      CGTGGCGGCCAGGGAACCCAACAGCATGCCCATACC---------CGCCC
PRL2010                     -GTGGCGGTCCGGGAACCCAAAAGCATGCCCGCACCACTCCTGGACGCAC

ITS_primer_fw               --------------------------------------------------
RBL67ITS_full_sequence      C-CACAACAGGGACGC---AACG--------ATCCTTCCACGCCAGCACG
PRL2010                     CGCACGACCGGCACGCCCGAAGGCCGACTCCACTCTTCCACACCAGCAAC

ITS_primer_fw               --------------------------------------------------
RBL67ITS_full_sequence      CCCGACACCCCAGGAAGA---AAACCCCGGGAAC-CATGCCAGGCACTCA
PRL2010                     CCCCACACACCCGGCACATCCGGGGCAAGGGAGCACACACCGGCCACCAG

ITS_primer_fw               ----------------------------
RBL67ITS_full_sequence      ATAACAT---------------------
PRL2010                     ACAGTGGCCTGAAATCTCCGTAGAAAGG

3
Entering edit mode
2.9 years ago
Joe 19k

This should work. Simply finds the index of the first column that doesn't contain a gap character.

import sys
from Bio import AlignIO

# Usage: python scriptname.py alignment format

for col in range(aln.get_alignment_length()):
if not "-" in aln[:, col]:
position = col
print("First full column is {}".format(col))
break

print("New alignment:")
print(aln[:, position:])

0
Entering edit mode

Thanks a lot for the help. Works properly. So col is referred to column into the array, which is the equivalent in term of rows? I find AlignIO documentation really lackful for help (or maybe is just me).

Your last print statement just print out all the letters inside the position found, not the entire row. Needs to be edited into

print(aln[:,position::])

1
Entering edit mode

The syntax aln[x, y] treats the alignment as a grid of sorts. In python : is shorthand for 'all elements of this dimension'. So, aln[:, 0] would mean print the first column of all the sequences in the alignment. Conversely, aln[0, :] would mean print all the characters of the first sequence (remembering that python uses 0-based indexing). aln[:, :] would print the full alignment.

If you wanted to iterate each row, you would do for row in len(aln). AlignIO treats the length of the alignment as the number of sequences (it's a little counterintuitive sometimes). This is why you need the method get_alignment_length() if you want to know the actual length of the sequences.

As for the print statement, I don't believe its wrong? Can you explain what you mean?

I used this example input file:

 10 50
KM894618.1 ---------- ---------- ---------- --TCTTTGCA TTTATTACGG
KU508975.1 AAATTCTTCG ATATTGGCTG AAAGATCCCT CTTCTTTGCA TTTATTACGA
KC747175.1 AAACTCTCCG ATACTGGTTG AAAGATGCTT CTTCTTTGCA TTTATTACGA
KF632783.1 AAGTTCTGCA AGGCTGGATA CAAGATGTTC CGTCTTTACA TTTATTGCGG
GU135030.1 AAACTCTCCG ATACTGGTTG AAAGATGCTT CTTCTTTGCA TTTATTACGA
HM989726.1 AGGCTCTTCG CTATTGGATA AAAGATGCTT CCTCTTTGCA TTTATTAAGA
KF163819.1 ---------- ---------- ---GATGTTC CGTCTTTGCM TTTATTGCGA
MG225316.1 ---------- ---------- ---------T CCTCTTTGCA TTTATTAAGA
JN895697.1 ---------- ---------- -------CTT CCTCTTTGCA TTTATTAAGA
MF350103.1 ---------- ------GGTA AAAGACGCCT CCTCTTTGTA TTTATTAAGA


And the script result is:

First full column is 32
New alignment:
SingleLetterAlphabet() alignment with 10 rows and 18 columns
TCTTTGCATTTATTACGG KM894618.1
TCTTTGCATTTATTACGA KU508975.1
TCTTTGCATTTATTACGA KC747175.1
TCTTTACATTTATTGCGG KF632783.1
TCTTTGCATTTATTACGA GU135030.1
TCTTTGCATTTATTAAGA HM989726.1
TCTTTGCMTTTATTGCGA KF163819.1
TCTTTGCATTTATTAAGA MG225316.1
TCTTTGCATTTATTAAGA JN895697.1
TCTTTGTATTTATTAAGA MF350103.1


Which appears to be right by my estimation. The python syntax list[num:] uses range notation. If no final number is given, num: means from num to the end of the object.

1
Entering edit mode

Ops, I'm sorry. Maybe I forget to add the : initially when I wrote the code into Pycharm, that's why it doesn't work. I've used this syntax because I thought it's necessary to use a range starting from position to : (which is literally a bad way to intend the whole row). Thanks for the explanation.