Trying to use python regular expressions to filter fasta file sequence headers
1
0
Entering edit mode
9 weeks ago
Ash • 0

Hi,

Apologies if I do not follow the correct question formatting, this is my first time posting. My question is regarding the use of python regular expressions. I have a fasta file of sequences following the format:

>NODE_143195_length_100_cov_16076.000000
TTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTCAG
TCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCGGC
>NODE_143196_length_100_cov_15891.000000
CTTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTCA
GTCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCGG
>NODE_143197_length_100_cov_15696.000000
GCTTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTC
AGTCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCG

I am trying to filter by both length and coverage. I want to filter sequences less than 5000bp and less than 100 coverage. I have been trying different variations of the following line:

^.+cov=([5-9][0-9][0-9]|([1-9]\d{4}\d*)\..+$

But I cannot seem to make it work. If anyone can help me, would be greatly appreciated. Thanks

filter fasta QC python read • 148 views
ADD COMMENT
0
Entering edit mode
9 weeks ago

Are the length and coverage values taken from header? If so, keep it simple:

from Bio import SeqIO
import sys

for i in SeqIO.parse("file.fa","fasta"):
     isplit=i.id.rsplit("_")
     if (int(isplit[3])>20 and int(float(isplit[5]))>100) :
         print(">"+i.id,"\n",i.seq)

input and output:

$ cat file.fa 
>NODE_143195_length_10_cov_16000.000000
TTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTCAGTCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCGGC
>NODE_143196_length_100_cov_15891.000000
CTTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTCAGTCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCGG
>NODE_143197_length_100_cov_15.000000
GCTTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTCAGTCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCG

$ python test.py file.fa
>NODE_143196_length_100_cov_15891.000000 
 CTTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTCAGTCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCGG

You can use seqio.write instead of print as the object after parsing is still seqrecord.

ADD COMMENT

Login before adding your answer.

Traffic: 2154 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6