Question: Print line based on partial match
0
gravatar for leo1985.arnab
11 weeks ago by
leo1985.arnab0 wrote:

I have two files with several hundred entries in each. File 1 has several 5 base seqeunces and file 2 has higher number of entries but with longer sequences. The first 5 bases of sequences in file 2 matches that of file 1. I tried some grep and awk methods , but did not work out for a partial match case as above. So for example:

File 1:

       ATGCC
       TTGCA
       GGAAC

........
........

File 2:

ATTTCGGGAAAATT
ATGCCTTAAGACCT
GGAACTAAGGGGA
............
............

Expected outcome:

ATGCCTTAAGACCT
GGAACTAAGGGGA

Any help is much appreciated ! Thanks !

sequence • 244 views
ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by leo1985.arnab0
0
gravatar for shenwei356
11 weeks ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

grep -f short.seq.file long.seq.file

ADD COMMENTlink written 11 weeks ago by shenwei3563.4k

Shenwei, thanks for the reply. But I already tried that grep option before posting the topic. It didn't work.

ADD REPLYlink written 11 weeks ago by leo1985.arnab0

It definitely will work, but you have to put ^ in front of the 5 letter sequences in File1 ...

^ATGCC
^TTGCA
^GGAAC

If you don't want to use grep then any program that will separate based on user-defined barcodes - flexbar / etc - will do this for you.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by george.ry1.0k
0
gravatar for Sparrow_kop
11 weeks ago by
Sparrow_kop180
China
Sparrow_kop180 wrote:

Hi, because "The first 5 bases of sequences in file 2 matches that of file 1. " So 'grep -f file1 file2' is not so robust, because the pattern may be exist in other location other than the first 5 base. So you can use regular expression in bash :

#!/bin/bash

cat file1.txt | while read pattern
do 
    grep "^$pattern" file2.txt
done
ADD COMMENTlink written 11 weeks ago by Sparrow_kop180

Sparrow_kop the script is working. Thanks ! But only if the sequences are in the same order in both files. I did mistakenly write previously that the total number of sequences in 2 files are identical, actually they are not. Apologies. File 2 with the larger sequences has many many more sequences. But either way, is there a way to by pass the order in the search? Sorting probably may not be a good idea with sequences.

ADD REPLYlink written 11 weeks ago by leo1985.arnab0

Hi, I think I don't get it , what's the meaning of 'same order', you mean you want match the reverse complementation? Or you means the sequence order, for example the alphabetical order? If it is the latter one, the order does not matter, because for each loop, grep will match the pattern on the whole sequences in file2, so you need not to sort it. Also it is ok that the total number of sequences in 2 files are not identical.

ADD REPLYlink written 11 weeks ago by Sparrow_kop180

This can also be written as (without cat):

while read pattern ; do grep "^$pattern" file2.txt ; done < file1.txt
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by jrj.healey2.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 703 users visited in the last hour