Question: Extracting Sequences After "Motif" & Between Motifs In Multifasta File
0
gravatar for Raghul
6.1 years ago by
Raghul200
Italy
Raghul200 wrote:

Hi I want to extract sequences after a motif say "TTTTTAAAAA" from a multifasta file. I do not want the nucleotides before this keyword. Is it possible to extract nucleotides between 2 motifs with grep? eg. nucleotides between TTTTTAAAA & AAAATTTT. I tried with grep but I need the fasta headers also. Can anybody suggest a solution in grep (if possible) or perl or python.

thanx raghul

parsing • 2.5k views
ADD COMMENTlink modified 5.9 years ago by PoGibas4.8k • written 6.1 years ago by Raghul200

You can get a case with the motif found several times within a same sequence. How do you want to deal with that?

ADD REPLYlink written 6.0 years ago by Manu Prestat3.9k

Hello!, I would like to do something similar...did you find a way to complete your task?

ADD REPLYlink written 2.3 years ago by etarisal0
1
gravatar for Ying W
6.1 years ago by
Ying W3.9k
South San Francisco, CA
Ying W3.9k wrote:

I don't think it would be possible with grep but this can be done w/a regex in perl. Something along the lines of:

$line = "";
foreach(<FILE>) { #for every line of the file
  chomp;
  if($_[0] == ">") { #if line starts with >, it is a header so process the previous sequence
    if($line =~ /[TTTTTAAAAA([ACTGN]+)AAAATTTT/g) { #regex to match motif
      print "$1\n" #print sequence in between motif
    }
   $line = ""
    print "$_"; #print header
  }
  else {
    $line = $line.$_ #append sequence
  }
}
if($line =~ /[ACTGN]*TTTTTAAAAA([ACTGN]+)AAAATTTT/g) {
  print "$1\n"
}

or something like that, (warning above code is untested and should be treated as pseudocode)

ADD COMMENTlink written 6.1 years ago by Ying W3.9k

Some (many?) versions of grep, such as the "standard" version included in Linux distributions, take the option "-P" meaning "interpret regex as a Perl regex". So if Perl can do it, so can grep.

ADD REPLYlink written 6.1 years ago by Neilfws48k
1
gravatar for PoGibas
5.9 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

grep way

  echo NNNTTTTTAAAACCCAAAATTTTNNN > sequence
  grep -o TTTTTAAAA[A-Z]*AAAATTTT sequence 
  TTTTTAAAACCCAAAATTTT
ADD COMMENTlink written 5.9 years ago by PoGibas4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 649 users visited in the last hour