How can I summarize sequence search pattern using 'grep' on a single file containing multiple sequences
3
0
Entering edit mode
9.3 years ago
Alpha ▴ 30

Apologies!!!

I want to find a "pattern" in a sequence and count how many of those patterns occur in it. I know fairly that 'grep' can serve the purpose but I have a single file with multiple sequences in .fasta format. I want to have my output (number of patterns in each of the sequence) for individual sequences in a list. How can we do this using 'awk'?

Thanks a lot in advance for help!

sequence • 4.8k views
ADD COMMENT
3
Entering edit mode

Why is this addressed to Pierre? I like the direct approach though, cut out the middle men :)

ADD REPLY
1
Entering edit mode

Don't worry about the comments, people were just having a laugh because you addressed Pierre specifically and it's funny because he probably is the best person to ask about awk. Nothing other than good-hearted fun was intended.

ADD REPLY
0
Entering edit mode

I wouldn't have thought grep would be capable of this unless your fasta has one line per sequence

ADD REPLY
0
Entering edit mode

I found seqinR does pattern searching very nicely for one single fasta sequence. But, I have multiple sequences (say 100) in a single file, and wants the results (no#of patterns found) of individual sequence separately in a tab delimited file.

Thanks you in advance for help!

ADD REPLY
5
Entering edit mode
9.3 years ago
zlskidmore ▴ 290

Perhaps not the most efficient, but this perl 1 liner should do it:

cat INPUT_FILE | perl -ne 'if($_ =~ />/){print "\n$_"; next;} else {chomp $_; print $_;}' | sed 1d | perl -ne 'if($_ =~ />/){print "$_"; next;} else {$count = () = $_ =~ /PATTERN/gi; print "$count\n"; $count=0;}'

Description:

Print sequence header followed by the corresponding sequence on next line (for each sequence):

perl -ne 'if($_ =~ />/){print "\n$_"; next;} else {chomp $_; print $_;}'

Remove the erroneous new line introduced above:

sed 1d

If header is encountered print else print the number of times the "pattern" is seen

perl -ne 'if($_ =~ />/){print "$_"; next;} else {$count = () = $_ =~ /PATTERN/gi; print "$count\n"; $count=0;}'
ADD COMMENT
0
Entering edit mode

Thanks a lot Zlskidmore. It worked.

ADD REPLY
3
Entering edit mode
9.3 years ago
Ram 43k

Edit: I'm not Pierre either, so apologies for that!

You could probably use bioawk and a substitution count sub(/<pattern>/,"") to count the number of occurrences per sequence, assuming no two matches overlap.

If you would like to deal with overlaps, use a Perl script instead. my @matches = $string ~ m/pattern/g with a scalar(@matches) should do it.

ADD COMMENT
2
Entering edit mode

ADD REPLY
1
Entering edit mode
9.3 years ago

Edit: Je suis pas Pierre (or something like that, I speak English and German).

Just linearize the file (google "fasta linearize" for awk and biopieces methods...or just see a couple here) and then pipe that to grep.

ADD COMMENT
0
Entering edit mode

But does grep -c count multiple times if multiple matches are found in the same line?

ADD REPLY
0
Entering edit mode

I think it'll just count once in that case, in which case your bioawk solution would be more appropriate.

ADD REPLY
0
Entering edit mode

Even mine doesn't address all contingencies - what if there are overlaps? Maybe a perl script with a my @matches = $string ~ m/pattern/g with a scalar(@matches) might do it.

ADD REPLY
2
Entering edit mode

Yeah, I suspect the best method is biopython/bioperl based, but I guess we should wait for Pierre to reply.

ADD REPLY
5
Entering edit mode

ADD REPLY
3
Entering edit mode

Haha.. This should be the official meme of Biostars for an accepted answer :)

ADD REPLY
0
Entering edit mode

The Bio::Grep package (not part of bioperl) was designed exactly for this type of problem. - not Pierre

ADD REPLY
0
Entering edit mode

Should OP's post be edited to remove the first part, people will wonder why everyone is adding a disclaimer to their statements :)

ADD REPLY

Login before adding your answer.

Traffic: 2372 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6