Print line based on partial match
2
0
Entering edit mode
6.6 years ago

I have two files with several hundred entries in each. File 1 has several 5 base seqeunces and file 2 has higher number of entries but with longer sequences. The first 5 bases of sequences in file 2 matches that of file 1. I tried some grep and awk methods , but did not work out for a partial match case as above. So for example:

File 1:

       ATGCC
       TTGCA
       GGAAC

........
........

File 2:

ATTTCGGGAAAATT
ATGCCTTAAGACCT
GGAACTAAGGGGA
............
............

Expected outcome:

ATGCCTTAAGACCT
GGAACTAAGGGGA

Any help is much appreciated ! Thanks !

sequence • 3.1k views
ADD COMMENT
0
Entering edit mode
6.6 years ago

grep -f short.seq.file long.seq.file

ADD COMMENT
0
Entering edit mode

Shenwei, thanks for the reply. But I already tried that grep option before posting the topic. It didn't work.

ADD REPLY
0
Entering edit mode

It definitely will work, but you have to put ^ in front of the 5 letter sequences in File1 ...

^ATGCC
^TTGCA
^GGAAC

If you don't want to use grep then any program that will separate based on user-defined barcodes - flexbar / etc - will do this for you.

ADD REPLY
0
Entering edit mode
6.6 years ago
Sparrow_kop ▴ 260

Hi, because "The first 5 bases of sequences in file 2 matches that of file 1. " So 'grep -f file1 file2' is not so robust, because the pattern may be exist in other location other than the first 5 base. So you can use regular expression in bash :

#!/bin/bash

cat file1.txt | while read pattern
do 
    grep "^$pattern" file2.txt
done
ADD COMMENT
0
Entering edit mode

Sparrow_kop the script is working. Thanks ! But only if the sequences are in the same order in both files. I did mistakenly write previously that the total number of sequences in 2 files are identical, actually they are not. Apologies. File 2 with the larger sequences has many many more sequences. But either way, is there a way to by pass the order in the search? Sorting probably may not be a good idea with sequences.

ADD REPLY
0
Entering edit mode

Hi, I think I don't get it , what's the meaning of 'same order', you mean you want match the reverse complementation? Or you means the sequence order, for example the alphabetical order? If it is the latter one, the order does not matter, because for each loop, grep will match the pattern on the whole sequences in file2, so you need not to sort it. Also it is ok that the total number of sequences in 2 files are not identical.

ADD REPLY
0
Entering edit mode

This can also be written as (without cat):

while read pattern ; do grep "^$pattern" file2.txt ; done < file1.txt
ADD REPLY
0
Entering edit mode

Hey one thing i want to ask. I'm supposed to store every line in a file as n number of patterns and match those n patterns with every line of file2. Can you tell me how to do this?

ADD REPLY

Login before adding your answer.

Traffic: 2671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6