Question

regex for select dna sequence

0

Entering edit mode

4.1 years ago

loisveillat ▴ 10

Hello,

I'm trying to remove duplicates in a text file, here is an extract :

 >KJ636215.1_Tripyla_glomerans 
ATGTCTAAGCACAGCCCTTGAATGGTAAAGCCGCGAATGGCTCATTACAACAGCCACAGTTTATTGGGTC TCCTTTTACTTGGATAACTGAGCTAATTGTTGAGCTAATACACGCACCAAAGCTTCGACCTCACGGAAGG AGCGCATTTATTAGAACAAAACCAATCGGACTTCGGTCCGTCCATTGGTGAATCTAAATAACTCGGCCGA TCGCATGGTCTCGCACCGGCGACGCACCTTTCAAATGTCTGCCTTATCACCTTTCGATGGTAGTTTATAC
>KJ636220.1_Chromadorina_bioculata 
ATGTCTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTACAACAGCCATAGTTTATTGGATC TAATATCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACACGCACTCAAGCCCCGACTTCGGAAAG GGCGCATTTATTAGAACAAGACCAATTGGCTTCGGCCATCTATTGGTGAATCTGAATAACTACGCAGATC GCACAGGCTTGTCCTGGCGACATATCCTTCAAGTGTCTGCCTTATCAACTGTCGATGGTAGTTTATTGGA
 >KJ636220.1_Chromadorina_bioculata 
ATGTCTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTACAACAGCCATAGTTTATTGGATC TAATATCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACACGCACTCAAGCCCCGACTTCGGAAAG GGCGCATTTATTAGAACAAGACCAATTGGCTTCGGCCATCTATTGGTGAATCTGAATAACTACGCAGATC GCACAGGCTTGTCCTGGCGACATATCCTTCAAGTGTCTGCCTTATCAACTGTCGATGGTAGTTTATTGGA CTACCATGGTTGTAACGGGTAACGGAGAATTAGGGTTCGACTCCGGAGAGGGAGCCTGAGATACGGCTAC
 >FJ040471.1_Chromadorina_sp 
ATGTGTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTATTCAGCCTCAATTTATTAGATCT AATCAGTTACTTGGATAACTGTTCAAAAGGAAGAGCTAAGACATGCCTCGAAAGTGTAGCGCAAGCTATA CTGCACTTCTTAGAAAAAACCGATTGGCTTCGGCCATCCATTGGTGAATCTTCTGAAATTCGCAGATCGC

To do this I write a python script. I try to make a regex selecting in a first group the accession number and then in a second group the dna sequence corresponding to the accession number. Unfortunately I can't do this regex.

Here is my code beginning :

import re
from collections import defaultdict
with open ("Mixed-Sequences.txt","r") as f1:
    for lignes in f1:
        lignes=lignes.rstrip("\n")
        match=re.search("^(>..........)_\S+\n([ATCG]+\n)+",lignes) 
        if match:
            print(match)

Can you help me please ?

Thanks

python regex dna fasta programming • 1.2k views

ADD COMMENT • link updated 4.1 years ago by Istvan Albert 100k • written 4.1 years ago by loisveillat ▴ 10

0

Entering edit mode

You had already used the code button to format your code but it also helps to format your input example with it as well. I have done this for you but if the format does not match what you actually have (if file is not plain fasta) then please edit original post again and change as needed.

ADD REPLY • link 4.1 years ago by GenoMax 141k

score 1 · Answer 1 · 2020-03-20

The reason your code doesn't work is that you are iterating on file lines but trying to search multiple lines at the same time.
There are multiple ways around this, but I'd say the most elegant one is to use a ready-made library to parse fasta files such as Bio.SeqIO.
if all you need is to remove duplicate sequences, then you can do something like:

from Bio import SeqIO
import sys
in_fasta = sys.argv[1]
out_fasta = sys.argv[2]

records = {}
for rec in SeqIO.parse(in_fasta, "fasta"):
    if rec.id not in records:
        records[rec.id] = rec
    else:
        if rec.seq == records[rec.id].seq:
            continue
        else:
            # you need to decide what you want to do in this case

SeqIO.write(records.values(), out_fasta, "fasta")

score 0 · Answer 2 · 2020-03-20

0

Entering edit mode

4.1 years ago

Asaf 10k

I didn't fully understand what you're trying to do but some suggestions: Use Biopython to read the fasta file, a great template to start with would be bioinitio: https://github.com/bionitio-team/bionitio-python

ADD COMMENT • link 4.1 years ago by Asaf 10k

score 0 · Answer 3 · 2020-03-20

Sequence manipulation tasks are very common, before embarking on writing new code I would recommend investigating the following (similarly named) toolkits that may already contain the functionality you need:

seqkit by Wei Shen https://bioinf.shenwei.me/seqkit/usage/#rmdup

and

segtk by Heng Li: https://github.com/lh3/seqtk