regex dont match why??
3
1
Entering edit mode
8.2 years ago

Hi guys,

I made a script that works very well he search ID's from other file and compare with genome sequence file and the output is when its match they print to a another file.

I run this script for different files of ID's and its fine, until now! Seems that my regular expression don't match with one specific ID and I don't know why!

#This is the regex
$key =~ m/^>([A-Z]+[0-9]+[A-Z]+(\-[A-Z])*).+$/o

my $header_sub = $1;

And the ID that doesn't match is:

>YER062C GPP2 SGDID:S000000864, Chr V from 280682-279930, Genome Release 64-2-1, reverse complement, Verified ORF, "DL-glycerol-3-phosphate phosphatase involved in glycerol biosynthesis; also known as glycerol-1-phosphatase; induced in response to hyperosmotic or oxidative stress, and during diauxic shift; GPP2 has a paralog, GPP1, that arose from the whole genome duplication"
ATGGGATTGACTACTAAACCTCTATCTTTGAAAGTTAACGCCGCTTTGTTCGACGTCGACGGTACCATTATCATCTCTCAACCAGCCATTGCTGCATTCTGGAGGGATTTCGGTAAGGACAAACCTTATTTCGATGCTGAACACGTTATCCAAGTCTCGCATGGTTGGAGAACGTTTGATGCCATTGCTAAGTTCGCTCCAGACTTTGCCAATGAAGAGTATGTTAACAAATTAGAAGCTGAAATTCCGGTCAAGTACGGTGAAAAATCCATTGAAGTCCCAGGTGCAGTTAAGCTGTGCAACGCTTTGAACGCTCTACCAAAAGAGAAATGGGCTGTGGCAACTTCCGGTACCCGTGATATGGCACAAAAATGGTTCGAGCATCTGGGAATCAGGAGACCAAAGTACTTCATTACCGCTAATGATGTCAAACAGGGTAAGCCTCATCCAGAACCATATCTGAAGGGCAGGAATGGCTTAGGATATCCGATCAATGAGCAAGACCCTTCCAAATCTAAGGTAGTAGTATTTGAAGACGCTCCAGCAGGTATTGCCGCCGGAAAAGCCGCCGGTTGTAAGATCATTGGTATTGCCACTACTTTCGACTTGGACTTCCTAAAGGAAAAAGGCTGTGACATCATTGTCAAAAACCACGAATCCATCAGAGTTGGCGGCTACAATGCCGAAACAGACGAAGTTGAATTCATTTTTGACGACTACTTATATGCTAAGGACGATCTGTTGAAATGGTAA

I have tried several things include delete the ID and write again... I checked all the phases from my script and its here on match thing that "disappear"!

I will be very grateful if you help me!

Cheers

perl regular-expression sequence • 2.2k views
ADD COMMENT
0
Entering edit mode

The match seems to work just fine but when I run do not retrieve the sequence that I want! :(

ADD REPLY
0
Entering edit mode

Your regex works on a single line. For me, it works if I use it only on the header (without the sequence (ATGGGATTGACTACTAA...). Is the sequence part of the ID? Or did something go wrong in splitting HEADER and SEQUENCE in your script before the regex part.

ADD REPLY
0
Entering edit mode

Yes regex works on single line its just for header and the $header_sub variable only select the match for IDs example YER062C. my script works very well for others files that contains more than 300 IDs and search in genome file with more than 6000 sequences. The script retrieve all sequences ID's from the another file that contains YER062C except the this ID!

ADD REPLY
0
Entering edit mode

Okay, but the problem is that I cannot reproduce the error. As I said, it depends on what is in $key. If I only use >YER062C or >YER062C GPP2 SGDID:S000000864, Chr V from 280682-279930, Genome Release 64-2-1, reverse complement, Verified ORF, "DL-glycerol-3-phosphate phosphatase involved in glycerol biosynthesis; also known as glycerol-1-phosphatase; induced in response to hyperosmotic or oxidative stress, and during diauxic shift; GPP2 has a paralog, GPP1, that arose from the whole genome duplication", your regex does work. So to be able to help, I need to know exactly, what your script does / what $key contains and how you read your files.

ADD REPLY
0
Entering edit mode

Can I sent to your mail my script and my files?

ADD REPLY
1
Entering edit mode

Put the files in a dropbox and share them if you want. And specify exactly what you want to select

ADD REPLY
1
Entering edit mode

Sure, for mail see my profile

ADD REPLY
3
Entering edit mode
8.2 years ago
Enter into www.regex101.com and you will see why
ADD COMMENT
0
Entering edit mode

WOW I didn't know bout this site :D

I will try it!

Thanks a lot.

ADD REPLY
1
Entering edit mode
8.2 years ago

Problem resolved!

Conclusion: check always the files for extra spaces and stuff ahaha

ADD COMMENT
0
Entering edit mode
8.2 years ago
thackl ★ 3.0k

Just as an alternative to writing your own script. Straight-forward and also Perl based:

git clone https://github.com/BioInf-Wuerzburg/SeqFilter.git
cd SeqFilter
make  # just fetches some libraries, no root or anything required

bin/SeqFilter big.fasta --ids idx.txt --out big-filtered.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6