Question

regex dont match why??

1

Entering edit mode

8.2 years ago

andreiareis1987 ▴ 40

Hi guys,

I made a script that works very well he search ID's from other file and compare with genome sequence file and the output is when its match they print to a another file.

I run this script for different files of ID's and its fine, until now! Seems that my regular expression don't match with one specific ID and I don't know why!

#This is the regex
$key =~ m/^>([A-Z]+[0-9]+[A-Z]+(\-[A-Z])*).+$/o

my $header_sub = $1;

And the ID that doesn't match is:

>YER062C GPP2 SGDID:S000000864, Chr V from 280682-279930, Genome Release 64-2-1, reverse complement, Verified ORF, "DL-glycerol-3-phosphate phosphatase involved in glycerol biosynthesis; also known as glycerol-1-phosphatase; induced in response to hyperosmotic or oxidative stress, and during diauxic shift; GPP2 has a paralog, GPP1, that arose from the whole genome duplication"
ATGGGATTGACTACTAAACCTCTATCTTTGAAAGTTAACGCCGCTTTGTTCGACGTCGACGGTACCATTATCATCTCTCAACCAGCCATTGCTGCATTCTGGAGGGATTTCGGTAAGGACAAACCTTATTTCGATGCTGAACACGTTATCCAAGTCTCGCATGGTTGGAGAACGTTTGATGCCATTGCTAAGTTCGCTCCAGACTTTGCCAATGAAGAGTATGTTAACAAATTAGAAGCTGAAATTCCGGTCAAGTACGGTGAAAAATCCATTGAAGTCCCAGGTGCAGTTAAGCTGTGCAACGCTTTGAACGCTCTACCAAAAGAGAAATGGGCTGTGGCAACTTCCGGTACCCGTGATATGGCACAAAAATGGTTCGAGCATCTGGGAATCAGGAGACCAAAGTACTTCATTACCGCTAATGATGTCAAACAGGGTAAGCCTCATCCAGAACCATATCTGAAGGGCAGGAATGGCTTAGGATATCCGATCAATGAGCAAGACCCTTCCAAATCTAAGGTAGTAGTATTTGAAGACGCTCCAGCAGGTATTGCCGCCGGAAAAGCCGCCGGTTGTAAGATCATTGGTATTGCCACTACTTTCGACTTGGACTTCCTAAAGGAAAAAGGCTGTGACATCATTGTCAAAAACCACGAATCCATCAGAGTTGGCGGCTACAATGCCGAAACAGACGAAGTTGAATTCATTTTTGACGACTACTTATATGCTAAGGACGATCTGTTGAAATGGTAA

I have tried several things include delete the ID and write again... I checked all the phases from my script and its here on match thing that "disappear"!

I will be very grateful if you help me!

Cheers

perl regular-expression sequence • 2.2k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by andreiareis1987 ▴ 40

0

Entering edit mode

The match seems to work just fine but when I run do not retrieve the sequence that I want! :(

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by andreiareis1987 ▴ 40

0

Entering edit mode

Your regex works on a single line. For me, it works if I use it only on the header (without the sequence (ATGGGATTGACTACTAA...). Is the sequence part of the ID? Or did something go wrong in splitting HEADER and SEQUENCE in your script before the regex part.

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by thackl ★ 3.0k

0

Entering edit mode

Yes regex works on single line its just for header and the $header_sub variable only select the match for IDs example YER062C. my script works very well for others files that contains more than 300 IDs and search in genome file with more than 6000 sequences. The script retrieve all sequences ID's from the another file that contains YER062C except the this ID!

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by andreiareis1987 ▴ 40

0

Entering edit mode

Okay, but the problem is that I cannot reproduce the error. As I said, it depends on what is in $key. If I only use >YER062C or >YER062C GPP2 SGDID:S000000864, Chr V from 280682-279930, Genome Release 64-2-1, reverse complement, Verified ORF, "DL-glycerol-3-phosphate phosphatase involved in glycerol biosynthesis; also known as glycerol-1-phosphatase; induced in response to hyperosmotic or oxidative stress, and during diauxic shift; GPP2 has a paralog, GPP1, that arose from the whole genome duplication", your regex does work. So to be able to help, I need to know exactly, what your script does / what $key contains and how you read your files.

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by thackl ★ 3.0k

0

Entering edit mode

Can I sent to your mail my script and my files?

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by andreiareis1987 ▴ 40

1

Entering edit mode

Put the files in a dropbox and share them if you want. And specify exactly what you want to select

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by Antonio R. Franco ★ 5.1k

1

Entering edit mode

Sure, for mail see my profile

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by thackl ★ 3.0k

Ram · Answer 1 · 2016-02-13

3

Entering edit mode

8.2 years ago

Antonio R. Franco ★ 5.1k

Enter into www.regex101.com and you will see why

ADD COMMENT • link 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

WOW I didn't know bout this site :D

I will try it!

Thanks a lot.

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by andreiareis1987 ▴ 40

Ram · Answer 2 · 2016-02-15

1

Entering edit mode

8.2 years ago

andreiareis1987 ▴ 40

Problem resolved!

Conclusion: check always the files for extra spaces and stuff ahaha

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by andreiareis1987 ▴ 40

Ram · Answer 3 · 2016-02-15

0

Entering edit mode

8.2 years ago

thackl ★ 3.0k

Just as an alternative to writing your own script. Straight-forward and also Perl based:

git clone https://github.com/BioInf-Wuerzburg/SeqFilter.git
cd SeqFilter
make  # just fetches some libraries, no root or anything required

bin/SeqFilter big.fasta --ids idx.txt --out big-filtered.fasta

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by thackl ★ 3.0k