Removal of primers from matched sequences
1
0
Entering edit mode
3.8 years ago
adrian18_07 ▴ 10

I have a list of two matched sequences:

[[('---------------C----C---GT----GTR-GGK---AC-TGM-GGA-GGW--CATTGTCGAA-CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACGCGGGGTGGCCCCGGCTGCCTCGCGCGGAGGTGCTGCGGCTGAGTGCGCAAACTAGCTGCGCGCACGCTGTCCGTGCCACCTCCACTAACAGAACCCCGGCGCGGACTGCGCCAAGGAATAAAAAACGAATGAGAGCGAGCGCGCCCCCCTCGCCCCGGAGACGGTGCGCGATGGTGTGTGCCTCGCTGTCCATTGATAAACTAAACGACTCTCGGCAACGGATATCTCGGCTCTCGCATCGATGAARAACGTAGCGAAATGCGATACTTGGTGTGAATTGCARAATCCCGTGAATCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACGCAACGTCGCCGCCAACCCCACCCCTAGGGGCGGGAAGTTGGGGGCGGACTCTGGCCTCCCGTGCGCCTCGGCGCGCGGATGGCCTAAATTTCAGCTCCTGGCGAGGATCGCCACGACAAGCGGTGGTTTTTTGAACTAAGGACCTCGGGTGTTGTCGTGCGGCCTCCCGGAGGGAACGGACCCTGTGCGCTCGCGCACCATCCTATCGAGACCCCAGGTCAGTCGG--GAACACC-CGCTGAATTTAAGCATATCAATAAGCGGAGG', 'GGKAARKWAAAAAGTCGTAACAAGGTTTCCGT-AGG-TGAACCTG-CGGAAGG-ATCATTGTCGAAACATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACGCGGGGTGGCCCCGGCTGCCTCGCGCGGAGGTGCTGCGGCTGAGTGCGCAAACTAGCTGCGCGCACGCTGTCCGTGCCACCTCCACTAACAGAACCCCGGCGCGGACTGCGCCAAGGAATAAAAAACGAATGAGAGCGAGCGCGCCCCCCTCGCCCCGGAGACGGTGCGCGATGGTGTGTGCCTCGCTGTCCATTGATAAACTAAACGACTCTCGGCAACGGATATCTCGGCTCTCGCATCGATGAAGAACGTAGCGAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAATCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACGCAACGTCGCCGCCAACCCCACCCCTAGGGGCGGGAAGTTGGGGGCGGACTCTGGCCTCCCGTGCGCCTCGGCGCGCGGATGGCCTAAATTTCAGCTCCTGGCGAGGATCGCCACGACAAGCGGTGGTTTTTTGAACTAAGGACCTCGGGTGTTGTCGTGCGGCCTCCCGGAGGGAACGGACCCTGTGCGCTCGCGCACCATCCTATCGAGACCCCAGGTCAGT---YAGAAC-CCACG-----TT----------------------', 1312.2000000000025, 0, 743)], [('------------C------------CC-TGWAGGK---AC-TGCGGA-GGW--CATTGTCGAA-CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACGCGGGGTGGCCCCGGCTGCCTCGCGCGGAGGTGCTGCGGCTGAGTGCGCAAACTAGCTGCGCGCACGCTGTCCGTGCCACCTCCACTAACAGAACCCCGGCGCGGAC-TGCGCCAAGGAATAAAAAACGAATGAGAGCGAGCGCGCCCCCCTCGCCCCGGAGACGGTGCGCGATGGTGTGTGCCTCGCTGTCCATTGATAAACTAAACGACTCTCGGCAACGGATATCTCGGCTCTCGCATCGATGAAR-AACGTAGCGAAATGCGATACTTGGTGTGAATTGCAR-AATCCCGTGAATCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACGCAACGTCGCCGCCAACCCCACCCCTAGGGGCGGGAAGTTGGGGGCGGACTCTGGCCTCCCGTGCGCCTCGGCGCGCGR-ATGGCCTAAW-TTTCAGCTCCTGGCGAGGATCGCCACGACAAGCGGTGGTTTTTTGAACTAAGGACCTCGGGTGTTGTCGTGCGGCCTCCCGGAGGGAACGGACCCTGTGCGCTCGCGCACCATCCTATCGAGACCCCAGGTCAG-TCGG--GAA-CACCCGCTGA-ATTTAAGCATATCAATAAGCGGARGAA', 'KAAGTATAAAGTCGTAACAAGGTTTCCGT--AGG-TGAACCTGCGGAAGG-ATCATTGTCGAAACATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACGCGGGGTGGCCCCGGCTGCCTCGCGCGGAGGTGCTGCGGCTGAGTGCGCAAACTAGCTGCGCGCACGCTGTCCGTGCCACCTCCACTAACAGAACCCCGGCGCGGA-YTGCGCCAAGGAATAAAAAACGAATGAGAGCGAGCGCGCCCCCCTCGCCCCGGAGACGGTGCGCGATGGTGTGTGCCTCGCTGTCCATTGATAAACTAAACGACTCTCGGCAACGGATATCTCGGCTCTCGCATCGATGAA-GAACGTAGCGAAATGCGATACTTGGTGTGAATTGCA-GAATCCCGTGAATCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACGCAACGTCGCCGCCAACCCCACCCCTAGGGGCGGGAAGTTGGGGGCGGACTCTGGCCTCCCGTGCGCCTCGGCGCGCG-GATGGCCTAA-ATTTCAGCTCCTGGCGAGGATCGCCACGACAAGCGGTGGTTTTTTGAACTAAGGACCTCGGGTGTTGTCGTGCGGCCTCCCGGAGGGAACGGACCCTGTGCGCTCGCGCACCATCCTATCGAGACCCCA-GTCA-KT---YAGAAMC-CCC----AMAT-----C----C--T-----------', 1304.0000000000023, 0, 747)]]

I would like to remove primers in these sequences. Going from the middle right and left to find the first character "-". And then delete everything that is in front of this sign. For example, for the first match I would like to receive:

[('CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACGCGGGGTGGCCCCGGCTGCCTCGCGCGGAGGTGCTGCGGCTGAGTGCGCAAACTAGCTGCGCGCACGCTGTCCGTGCCACCTCCACTAACAGAACCCCGGCGCGGACTGCGCCAAGGAATAAAAAACGAATGAGAGCGAGCGCGCCCCCCTCGCCCCGGAGACGGTGCGCGATGGTGTGTGCCTCGCTGTCCATTGATAAACTAAACGACTCTCGGCAACGGATATCTCGGCTCTCGCATCGATGAARAACGTAGCGAAATGCGATACTTGGTGTGAATTGCARAATCCCGTGAATCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACGCAACGTCGCCGCCAACCCCACCCCTAGGGGCGGGAAGTTGGGGGCGGACTCTGGCCTCCCGTGCGCCTCGGCGCGCGGATGGCCTAAATTTCAGCTCCTGGCGAGGATCGCCACGACAAGCGGTGGTTTTTTGAACTAAGGACCTCGGGTGTTGTCGTGCGGCCTCCCGGAGGGAACGGACCCTGTGCGCTCGCGCACCATCCTATCGAGACCCCAGGTCAGTCGG', 'ATCATTGTCGAAACATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACGCGGGGTGGCCCCGGCTGCCTCGCGCGGAGGTGCTGCGGCTGAGTGCGCAAACTAGCTGCGCGCACGCTGTCCGTGCCACCTCCACTAACAGAACCCCGGCGCGGACTGCGCCAAGGAATAAAAAACGAATGAGAGCGAGCGCGCCCCCCTCGCCCCGGAGACGGTGCGCGATGGTGTGTGCCTCGCTGTCCATTGATAAACTAAACGACTCTCGGCAACGGATATCTCGGCTCTCGCATCGATGAAGAACGTAGCGAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAATCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACGCAACGTCGCCGCCAACCCCACCCCTAGGGGCGGGAAGTTGGGGGCGGACTCTGGCCTCCCGTGCGCCTCGGCGCGCGGATGGCCTAAATTTCAGCTCCTGGCGAGGATCGCCACGACAAGCGGTGGTTTTTTGAACTAAGGACCTCGGGTGTTGTCGTGCGGCCTCCCGGAGGGAACGGACCCTGTGCGCTCGCGCACCATCCTATCGAGACCCCAGGTCAGT', 1312.2000000000025, 0, 743)]

And the same for the others.

Thanks for any answer.

biopython • 648 views
ADD COMMENT
1
Entering edit mode

I am reasonably certain bbduk.sh from BBMap suite can do this. A guide is available here.

ADD REPLY
1
Entering edit mode
3.8 years ago
Joe 21k

The data structure you have is pretty confusing and heavily nested, but I think this is correct. I get the right result for your first example at least, but the second seems very short as there are "-" characters quite near the centre of the string (you may need to rethink your strategy?).

data = #...
for entry in data:
    new_strings = []
    for string in (entry[0][0], entry[0][1]):
        first_half = string[0:round(len(string)/2)]
        second_half = string.replace(first_half, "") # Using this replace method rather than string slicing again, as the mid point for an odd versus even string will differ and might cause chars to be missed.
        new_strings.append(first_half[first_half.rfind("-")+1:] + second_half[:second_half.find("-")])
    print([(new_strings[0], new_strings[1]) +  entry[0][2:]])

Gives me:

[('CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACGCGGGGTGGCCCCGGCTGCCTCGCGCGGAGGTGCTGCGGCTGAGTGCGCAAACTAGCTGCGCGCACGCTGTCCGTGCCACCTCCACTAACAGAACCCCGGCGCGGACTGCGCCAAGGAATAAAAAACGAATGAGAGCGAGCGCGCCCCCCTCGCCCCGGAGACGGTGCGCGATGGTGTGTGCCTCGCTGTCCATTGATAAACTAAACGACTCTCGGCAACGGATATCTCGGCTCTCGCATCGATGAARAACGTAGCGAAATGCGATACTTGGTGTGAATTGCARAATCCCGTGAATCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACGCAACGTCGCCGCCAACCCCACCCCTAGGGGCGGGAAGTTGGGGGCGGACTCTGGCCTCCCGTGCGCCTCGGCGCGCGGATGGCCTAAATTTCAGCTCCTGGCGAGGATCGCCACGACAAGCGGTGGTTTTTTGAACTAAGGACCTCGGGTGTTGTCGTGCGGCCTCCCGGAGGGAACGGACCCTGTGCGCTCGCGCACCATCCTATCGAGACCCCAGGTCAGTCGG', 'ATCATTGTCGAAACATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACGCGGGGTGGCCCCGGCTGCCTCGCGCGGAGGTGCTGCGGCTGAGTGCGCAAACTAGCTGCGCGCACGCTGTCCGTGCCACCTCCACTAACAGAACCCCGGCGCGGACTGCGCCAAGGAATAAAAAACGAATGAGAGCGAGCGCGCCCCCCTCGCCCCGGAGACGGTGCGCGATGGTGTGTGCCTCGCTGTCCATTGATAAACTAAACGACTCTCGGCAACGGATATCTCGGCTCTCGCATCGATGAAGAACGTAGCGAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAATCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACGCAACGTCGCCGCCAACCCCACCCCTAGGGGCGGGAAGTTGGGGGCGGACTCTGGCCTCCCGTGCGCCTCGGCGCGCGGATGGCCTAAATTTCAGCTCCTGGCGAGGATCGCCACGACAAGCGGTGGTTTTTTGAACTAAGGACCTCGGGTGTTGTCGTGCGGCCTCCCGGAGGGAACGGACCCTGTGCGCTCGCGCACCATCCTATCGAGACCCCAGGTCAGT', 1312.2000000000025, 0, 743)]
[('AACGTAGCGAAATGCGATACTTGGTGTGAATTGCAR', 'GAACGTAGCGAAATGCGATACTTGGTGTGAATTGCA', 1304.0000000000023, 0, 747)]
ADD COMMENT

Login before adding your answer.

Traffic: 1665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6