Question

how to grep word with hyphen/dash of A file in one column of B file

0

Entering edit mode

6.4 years ago

Ming Lu ▴ 30

Hi, I have a A.bed file only with the gene name of these

  chr1-1
  chr1-10
  chr1-102
  chr1-106
  chr1-11
  chr1-2
  chr1-3

and I know they also in one column of B.bed .

chr1 startpos endpos chr1-1
chr1 startpos endpos chr1-10
chr1 startpos endpos chr1-102
chr1 startpos endpos chr1-106
chr1 startpos endpos chr1-11
chr1 startpos endpos chr1-2
chr1 startpos endpos chr1-3
chr2 startpos endpos chr2-234
chr12 startpos endpos chr12-23546

However, why

  cut -f4 B.bed > C.bed # only use the gene name column
  comm -1 -2 A.bed C.bed

find all of them, But

  grep -w -f A.bed B.bed

only find

  chr1-1
  chr1-2
  chr1-3

Because comm cannot show whole rows in B.bed.

How could I use grep to call all the matched rows in B.bed?

Or how could I call all the rows in B.bed file with matched words of one column using another file?

ChIP-Seq • 3.9k views

ADD COMMENT • link 6.3 years ago by Ming Lu ▴ 30

0

Entering edit mode

Are the A and C files sorted?

comm -1 -2 <(sort A.bed) <(sort C.bed)

ADD REPLY • link 6.4 years ago by michael.ante ★ 3.8k

0

Entering edit mode

yes sorte have sorted， comm is right， grep cannot get right number

ADD REPLY • link 6.4 years ago by Ming Lu ▴ 30

2

Entering edit mode

Sorry, should have read it better. Are there special characters in one of the files?

head A.bed | sed -n 'l'
head B.bed | sed -n 'l'

Have you tried the join command?

join -1 4 -2 1 B.bed A.bed

ADD REPLY • link 6.4 years ago by michael.ante ★ 3.8k

1

Entering edit mode

6.4 years ago

mittu1602 ▴ 200

If its ok for you to use awk, use the following command:

awk 'FNR==NR{a[$1]=$4;next}{if(a[$1]==""){a[$1]=0};printf "%s%s%s%s%s%s%s%s%s\n",$1,FS,$2,FS,$3,FS,$4,FS,a[$1]}' B.bed A.bed  > result1

ADD COMMENT • link 6.4 years ago by mittu1602 ▴ 200

1

Entering edit mode

6.4 years ago

cpad0112 21k

output:

$ grep -w -f ids.txt test.txt 
chr1    startpos    endpos  chr1-1
chr1    startpos    endpos  chr1-10
chr1    startpos    endpos  chr1-102
chr1    startpos    endpos  chr1-106
chr1    startpos    endpos  chr1-11
chr1    startpos    endpos  chr1-2
chr1    startpos    endpos  chr1-3

$ join  -1 1 -2 4 ids.txt test.txt 
chr1-1 chr1 startpos endpos
chr1-10 chr1 startpos endpos
chr1-102 chr1 startpos endpos
chr1-106 chr1 startpos endpos
chr1-11 chr1 startpos endpos
chr1-2 chr1 startpos endpos
chr1-3 chr1 startpos endpos

input:

$ cat ids.txt 
chr1-1
chr1-10
chr1-102
chr1-106
chr1-11
chr1-2
chr1-3

$ cat test.txt 
chr1    startpos    endpos  chr1-1
chr1    startpos    endpos  chr1-10
chr1    startpos    endpos  chr1-102
chr1    startpos    endpos  chr1-106
chr1    startpos    endpos  chr1-11
chr1    startpos    endpos  chr1-2
chr1    startpos    endpos  chr1-3
chr2    startpos    endpos  chr2-234
chr12   startpos    endpos  chr12-23546

ADD COMMENT • link 6.4 years ago by cpad0112 21k

1

Entering edit mode

You can modify the join output with

join -1 1 -2 4 -o 2.1,2.2,2.3,0 ids.txt test.txt | tr ' ' '\t'

The tr command replaces the standard white-space with a tab.

ADD REPLY • link 6.4 years ago by michael.ante ★ 3.8k

1

Entering edit mode

Join supports tsv output natively. output from $ join -t $'\t' -1 1 -2 4 -o 2.1,2.2,2.3,0 ids.txt test.txt is = join -1 1 -2 4 -o 2.1,2.2,2.3,0 ids.txt test.txt | tr ' ' '\t'

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

6.4 years ago

Inquisitive8995 ▴ 270

Hi, Are the number of rows equal in both the files ? Try grep -Fwf A.bed B.bed > Output.txt

ADD COMMENT • link 6.4 years ago by Inquisitive8995 ▴ 270

0

Entering edit mode

not equal, A.bed has 645 rows, B.bed has 33024 rows. But all A.bed are from one column of B.bed.

I think maybe "-"dash break the -w limited string?

Tried your code, still cannot find the rest same gene with grep -Fwf

ADD REPLY • link 6.4 years ago by Ming Lu ▴ 30

0

Entering edit mode

In your command "comm -1 -3 A.bed C.bed"

-1 will suppress column 1 (lines unique to FILE 1) -3 will suppress column 3 (lines that appear in both files)

When using -3 , you are actually suppressing the lines that match in A.bed and B.bed.

Please try using "comm -1 -2 A.bed B.bed"

ADD REPLY • link 6.4 years ago by Inquisitive8995 ▴ 270

0

Entering edit mode

just writing mistake not the focus.

ADD REPLY • link 6.4 years ago by Ming Lu ▴ 30

0

Entering edit mode

6.4 years ago

EagleEye 7.5k

grep -w -Ff File2.txt File1.txt > commonFile1File2.txt

ADD COMMENT • link 6.4 years ago by EagleEye 7.5k

0

Entering edit mode

6.4 years ago

Ming Lu ▴ 30

Firstly, I change all "-" to "_", and only use the column I use for grep, but make no difference.

All 654 rows of moVDR1220 should be in 36551 rows of trytry.txt

as moVDR1220.txt is a result of

#first transform enhancer.txt to enhancer.bed (move name column such as chr1-10 from 1 to 4 )
#then
bedtools intersect -a enhancer.bed -b BBB.bed -wa | cut -f4 > moVDR1220.txt

and trytry.txt is the result of ( the wc -l of enhancer.txt, enhancer.bed, trytry.txt, trytry.cdt all 36551)

annotatePeaks.pl enhancer.txt hg19 -size 2000 -hist 10 -ghist -d 24hvitd/ 24heth/ > trytry.txt.
more trytry.txt|cut -f1> trytry.txt

so the grep or join or comm result should all be 654.

my data is:

homer $ more moVDR1220.txt|head
chr1_1
chr1_10
chr1_102
chr1_106
chr1_11
chr1_1140
chr1_115
chr1_12
chr1_123
chr1_14
homer$ more trytry.txt|head
Gene
chr1_1
chr1_10
chr1_100
chr1_1000
chr1_10000
chr1_10025
chr1_10028
chr1_10031
chr1_10037
homer$ grep -w -f moVDR1220.txt trytry.txt | wc -l
 180 
homer$ grep -w -f moVDR1220.txt trytry.txt | head
chr1_1
chr1_2
chr1_3
chr1_4 
chr1_5
chr1_6
chr1_75
chr1_76
chr1_8
chr1_9
homer$ join -1 1 -2 1 moVDR1220.txt trytry.txt | wc -l
 389
homer$ join -1 1 -2 1 moVDR1220.txt trytry.txt | head
chr1_1
chr1_10
chr1_102
chr1_106
chr1_11
chr1_1140
chr1_115
chr1_12
chr1_123
chr1_14
homer$ comm -1 -2 moVDR1220.txt trytry.txt| wc -l
 389

I know the problem now "-" didn't impact, a mistake in bedtools step.

But I still don;t know why grep cannot do this kind of thing.

ADD COMMENT • link 6.4 years ago by Ming Lu ▴ 30

score 2 · Accepted Answer · 2018-01-08

find a good code can get all matched lines:654! without need of sorting first. `

 awk -F '\t' 'NR==FNR{a[$1]=$1;next}; ($1==a[$1]){print $0}' a.bed b.bed > new.bed

in b.bed 's order with b.bed's columns

 awk -F '\t' 'NR==FNR{a[$1]=$0;next}; ($1 in a){print a[$1]}' a.bed b.bed > new.bed

in b.bed 's order with a.bed's columns