Question

How can I summarize sequence search pattern using 'grep' on a single file containing multiple sequences

0

Entering edit mode

9.3 years ago

Alpha ▴ 30

Apologies!!!

I want to find a "pattern" in a sequence and count how many of those patterns occur in it. I know fairly that 'grep' can serve the purpose but I have a single file with multiple sequences in .fasta format. I want to have my output (number of patterns in each of the sequence) for individual sequences in a list. How can we do this using 'awk'?

Thanks a lot in advance for help!

sequence • 4.8k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Alpha ▴ 30

3

Entering edit mode

Why is this addressed to Pierre? I like the direct approach though, cut out the middle men :)

ADD REPLY • link 9.3 years ago by SES 8.6k

1

Entering edit mode

Don't worry about the comments, people were just having a laugh because you addressed Pierre specifically and it's funny because he probably is the best person to ask about awk. Nothing other than good-hearted fun was intended.

ADD REPLY • link 9.3 years ago by SES 8.6k

0

Entering edit mode

I wouldn't have thought grep would be capable of this unless your fasta has one line per sequence

ADD REPLY • link 9.3 years ago by russhh 5.7k

0

Entering edit mode

I found seqinR does pattern searching very nicely for one single fasta sequence. But, I have multiple sequences (say 100) in a single file, and wants the results (no#of patterns found) of individual sequence separately in a tab delimited file.

Thanks you in advance for help!

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Alpha ▴ 30

1

Entering edit mode

9.3 years ago

Devon Ryan 104k

Edit: Je suis pas Pierre (or something like that, I speak English and German).

Just linearize the file (google "fasta linearize" for awk and biopieces methods...or just see a couple here) and then pipe that to grep.

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

But does grep -c count multiple times if multiple matches are found in the same line?

ADD REPLY • link 2.1 years ago by Ram 43k

0

Entering edit mode

I think it'll just count once in that case, in which case your bioawk solution would be more appropriate.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

Even mine doesn't address all contingencies - what if there are overlaps? Maybe a perl script with a my @matches = $string ~ m/pattern/g with a scalar(@matches) might do it.

ADD REPLY • link 2.1 years ago by Ram 43k

2

Entering edit mode

Yeah, I suspect the best method is biopython/bioperl based, but I guess we should wait for Pierre to reply.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

5

Entering edit mode

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Pierre Lindenbaum 161k

3

Entering edit mode

Haha.. This should be the official meme of Biostars for an accepted answer :)

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by poisonAlien ★ 3.2k

0

Entering edit mode

The Bio::Grep package (not part of bioperl) was designed exactly for this type of problem. - not Pierre

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by SES 8.6k

0

Entering edit mode

Should OP's post be edited to remove the first part, people will wonder why everyone is adding a disclaimer to their statements :)

ADD REPLY • link 2.1 years ago by Ram 43k

Ram · Accepted Answer · 2015-01-08

Perhaps not the most efficient, but this perl 1 liner should do it:

cat INPUT_FILE | perl -ne 'if($_ =~ />/){print "\n$_"; next;} else {chomp $_; print $_;}' | sed 1d | perl -ne 'if($_ =~ />/){print "$_"; next;} else {$count = () = $_ =~ /PATTERN/gi; print "$count\n"; $count=0;}'

Description:

Print sequence header followed by the corresponding sequence on next line (for each sequence):

perl -ne 'if($_ =~ />/){print "\n$_"; next;} else {chomp $_; print $_;}'

Remove the erroneous new line introduced above:

sed 1d

If header is encountered print else print the number of times the "pattern" is seen

perl -ne 'if($_ =~ />/){print "$_"; next;} else {$count = () = $_ =~ /PATTERN/gi; print "$count\n"; $count=0;}'

Ram · Accepted Answer · 2015-01-08

3

Entering edit mode

9.3 years ago

Ram 43k

Edit: I'm not Pierre either, so apologies for that!

You could probably use bioawk and a substitution count sub(/<pattern>/,"") to count the number of occurrences per sequence, assuming no two matches overlap.

If you would like to deal with overlaps, use a Perl script instead. my @matches = $string ~ m/pattern/g with a scalar(@matches) should do it.

ADD COMMENT • link 2.1 years ago by Ram 43k

2

Entering edit mode

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Pierre Lindenbaum 161k