Question: regex dont match why??
1
gravatar for andreiareis1987
4.7 years ago by
andreiareis198730 wrote:

Hi guys,

 

I made a script that works very well he search ID's from other file and compare with genome sequence file and the output is when its match they print to a another file.

I run this script for different files of ID's and its fine, until now! Seems that my regular expression dont match with one specific ID and i dont why!

#This is the regex
$key =~ m/^>([A-Z]+[0-9]+[A-Z]+(\-[A-Z])*).+$/o

my $header_sub = $1;

And the ID (in bold) that dont match is:

>YER062C GPP2 SGDID:S000000864, Chr V from 280682-279930, Genome Release 64-2-1, reverse complement, Verified ORF, "DL-glycerol-3-phosphate phosphatase involved in glycerol biosynthesis; also known as glycerol-1-phosphatase; induced in response to hyperosmotic or oxidative stress, and during diauxic shift; GPP2 has a paralog, GPP1, that arose from the whole genome duplication"
ATGGGATTGACTACTAAACCTCTATCTTTGAAAGTTAACGCCGCTTTGTTCGACGTCGACGGTACCATTATCATCTCTCAACCAGCCATTGCTGCATTCTGGAGGGATTTCGGTAAGGACAAACCTTATTTCGATGCTGAACACGTTATCCAAGTCTCGCATGGTTGGAGAACGTTTGATGCCATTGCTAAGTTCGCTCCAGACTTTGCCAATGAAGAGTATGTTAACAAATTAGAAGCTGAAATTCCGGTCAAGTACGGTGAAAAATCCATTGAAGTCCCAGGTGCAGTTAAGCTGTGCAACGCTTTGAACGCTCTACCAAAAGAGAAATGGGCTGTGGCAACTTCCGGTACCCGTGATATGGCACAAAAATGGTTCGAGCATCTGGGAATCAGGAGACCAAAGTACTTCATTACCGCTAATGATGTCAAACAGGGTAAGCCTCATCCAGAACCATATCTGAAGGGCAGGAATGGCTTAGGATATCCGATCAATGAGCAAGACCCTTCCAAATCTAAGGTAGTAGTATTTGAAGACGCTCCAGCAGGTATTGCCGCCGGAAAAGCCGCCGGTTGTAAGATCATTGGTATTGCCACTACTTTCGACTTGGACTTCCTAAAGGAAAAAGGCTGTGACATCATTGTCAAAAACCACGAATCCATCAGAGTTGGCGGCTACAATGCCGAAACAGACGAAGTTGAATTCATTTTTGACGACTACTTATATGCTAAGGACGATCTGTTGAAATGGTAA

 

I have tried several things include delete the ID and write again... i checked all the phases from my script and its here on match thing that "disappear"!

I will be very grateful if you help me!

 

Cheers 

 

 

 

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by andreiareis198730

Your regex works on a single line. For me, it works if I use it only on the header (without the sequence (ATGGGATTGACTACTAA...). Is the sequence part of the ID? Or did something go wrong in splitting HEADER and SEQUENCE in your script before the regex part.

ADD REPLYlink written 4.7 years ago by thackl2.8k

yes regex works on single line its just for header and the $header_sub variable only select the match for IDs example YER062C. my script works very well for others files that contains more than 300 IDs and search in genome file with more than 6000 sequences. The script retrieve all sequences ID's from the another file that contains YER062C except the this ID!

 

 

ADD REPLYlink written 4.7 years ago by andreiareis198730

Okay, but the problem is that I cannot reproduce the error. As I said, it depends on what is in $key. If I only use >YER062C or >YER062C GPP2 SGDID:S000000864, Chr V from 280682-279930, Genome Release 64-2-1, reverse complement, Verified ORF, "DL-glycerol-3-phosphate phosphatase involved in glycerol biosynthesis; also known as glycerol-1-phosphatase; induced in response to hyperosmotic or oxidative stress, and during diauxic shift; GPP2 has a paralog, GPP1, that arose from the whole genome duplication", your regex does work. So to be able to help, I need to know exactly, what your script does / what $key contains and how you read your files..

ADD REPLYlink written 4.7 years ago by thackl2.8k

can i sent to your mail my script and my files? 

ADD REPLYlink written 4.7 years ago by andreiareis198730
1

Put the files in a dropbox and share them if you want. And specify exactly what you want to select

ADD REPLYlink written 4.7 years ago by Antonio R. Franco4.5k
1

Sure, for mail see my profile

ADD REPLYlink written 4.7 years ago by thackl2.8k
3
gravatar for Antonio R. Franco
4.7 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.5k wrote:
Enter into www.regex101.com and you will see why
ADD COMMENTlink written 4.7 years ago by Antonio R. Franco4.5k
1
gravatar for andreiareis1987
4.7 years ago by
andreiareis198730 wrote:

Problem resolved!

 

Conclusion: check always the files for extra spaces and stuff ahaha 

ADD COMMENTlink written 4.7 years ago by andreiareis198730
0
gravatar for andreiareis1987
4.7 years ago by
andreiareis198730 wrote:

WOW i didnt know bout this site :D 

I will try it!

Thanks alot.

ADD COMMENTlink written 4.7 years ago by andreiareis198730
0
gravatar for andreiareis1987
4.7 years ago by
andreiareis198730 wrote:

The match seems to work just fine but when i run do not retrieve the sequence that i want! :(

ADD COMMENTlink written 4.7 years ago by andreiareis198730
0
gravatar for thackl
4.7 years ago by
thackl2.8k
MIT
thackl2.8k wrote:

Just as an alternative to writing your own script. Straight-forward and also Perl based:

git clone https://github.com/BioInf-Wuerzburg/SeqFilter.git
cd SeqFilter
make  # just fetches some libraries, no root or anything required

bin/SeqFilter big.fasta --ids idx.txt --out big-filtered.fasta
ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by thackl2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1508 users visited in the last hour