regex for select dna sequence
3
0
Entering edit mode
20 months ago
loisveillat ▴ 10

Hello,

I'm trying to remove duplicates in a text file, here is an extract :

 >KJ636215.1_Tripyla_glomerans 
ATGTCTAAGCACAGCCCTTGAATGGTAAAGCCGCGAATGGCTCATTACAACAGCCACAGTTTATTGGGTC TCCTTTTACTTGGATAACTGAGCTAATTGTTGAGCTAATACACGCACCAAAGCTTCGACCTCACGGAAGG AGCGCATTTATTAGAACAAAACCAATCGGACTTCGGTCCGTCCATTGGTGAATCTAAATAACTCGGCCGA TCGCATGGTCTCGCACCGGCGACGCACCTTTCAAATGTCTGCCTTATCACCTTTCGATGGTAGTTTATAC
>KJ636220.1_Chromadorina_bioculata 
ATGTCTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTACAACAGCCATAGTTTATTGGATC TAATATCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACACGCACTCAAGCCCCGACTTCGGAAAG GGCGCATTTATTAGAACAAGACCAATTGGCTTCGGCCATCTATTGGTGAATCTGAATAACTACGCAGATC GCACAGGCTTGTCCTGGCGACATATCCTTCAAGTGTCTGCCTTATCAACTGTCGATGGTAGTTTATTGGA
 >KJ636220.1_Chromadorina_bioculata 
ATGTCTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTACAACAGCCATAGTTTATTGGATC TAATATCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACACGCACTCAAGCCCCGACTTCGGAAAG GGCGCATTTATTAGAACAAGACCAATTGGCTTCGGCCATCTATTGGTGAATCTGAATAACTACGCAGATC GCACAGGCTTGTCCTGGCGACATATCCTTCAAGTGTCTGCCTTATCAACTGTCGATGGTAGTTTATTGGA CTACCATGGTTGTAACGGGTAACGGAGAATTAGGGTTCGACTCCGGAGAGGGAGCCTGAGATACGGCTAC
 >FJ040471.1_Chromadorina_sp 
ATGTGTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTATTCAGCCTCAATTTATTAGATCT AATCAGTTACTTGGATAACTGTTCAAAAGGAAGAGCTAAGACATGCCTCGAAAGTGTAGCGCAAGCTATA CTGCACTTCTTAGAAAAAACCGATTGGCTTCGGCCATCCATTGGTGAATCTTCTGAAATTCGCAGATCGC

To do this I write a python script. I try to make a regex selecting in a first group the accession number and then in a second group the dna sequence corresponding to the accession number. Unfortunately I can't do this regex.

Here is my code beginning :

import re
from collections import defaultdict
with open ("Mixed-Sequences.txt","r") as f1:
    for lignes in f1:
        lignes=lignes.rstrip("\n")
        match=re.search("^(>..........)_\S+\n([ATCG]+\n)+",lignes) 
        if match:
            print(match)

Can you help me please ?

Thanks

python regex dna fasta programming • 548 views
ADD COMMENT
0
Entering edit mode

You had already used the code button to format your code but it also helps to format your input example with it as well. I have done this for you but if the format does not match what you actually have (if file is not plain fasta) then please edit original post again and change as needed.

ADD REPLY
1
Entering edit mode
20 months ago
liorglic ▴ 460

The reason your code doesn't work is that you are iterating on file lines but trying to search multiple lines at the same time.
There are multiple ways around this, but I'd say the most elegant one is to use a ready-made library to parse fasta files such as Bio.SeqIO.
if all you need is to remove duplicate sequences, then you can do something like:

from Bio import SeqIO
import sys
in_fasta = sys.argv[1]
out_fasta = sys.argv[2]

records = {}
for rec in SeqIO.parse(in_fasta, "fasta"):
    if rec.id not in records:
        records[rec.id] = rec
    else:
        if rec.seq == records[rec.id].seq:
            continue
        else:
            # you need to decide what you want to do in this case

SeqIO.write(records.values(), out_fasta, "fasta")
ADD COMMENT
0
Entering edit mode
20 months ago
Asaf 8.6k

I didn't fully understand what you're trying to do but some suggestions: Use Biopython to read the fasta file, a great template to start with would be bioinitio: https://github.com/bionitio-team/bionitio-python

ADD COMMENT
0
Entering edit mode
20 months ago

Sequence manipulation tasks are very common, before embarking on writing new code I would recommend investigating the following (similarly named) toolkits that may already contain the functionality you need:

seqkit by Wei Shen https://bioinf.shenwei.me/seqkit/usage/#rmdup

and

segtk by Heng Li: https://github.com/lh3/seqtk

ADD COMMENT

Login before adding your answer.

Traffic: 2368 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6