Question

grep a column based on a string

2

Entering edit mode

6.1 years ago

vinayjrao ▴ 250

I have a file with around 20000 columns as gene names. I want to grep out the rpkm values for specific genes. Is there a way to grep out the column information?

sample gene17 gene92 gene1 ... gene20000

patient1 0.03569654 1.020565 0.0036522 ... 0.25247236

I only want gene72 for example, but it's not sorted in increasing order.

Thanks.

grep awk • 17k views

ADD COMMENT • link updated 6.1 years ago by shenwei356 8.4k • written 6.1 years ago by vinayjrao ▴ 250

1

Entering edit mode

6.1 years ago

shenwei356 8.4k

Try csvtk, (usage of csvtk cut).

For tab-delimited file: t.tsv

$ cat t.tsv 
sample  gene17  gene92  gene1   gene20000
patient1        0.03569654      1.020565        0.003652        0.25247236
patient2        0.13569654      1.320565        0.403652        0.95247236

Searching column(s)

$ csvtk cut -t -f sample,gene92 t.tsv                                                      
sample  gene92                                                                                               
patient1        1.020565                                                                                     
patient2        1.320565

$ csvtk cut -t -f sample,gene1,gene92 t.tsv                                                
sample  gene1   gene92
patient1        0.003652        1.020565
patient2        0.403652        1.320565

$ csvtk cut -t -f sample,gene000 t.tsv 
[ERRO] column "gene000" not existed in file: t.tsv

ADD COMMENT • link 6.1 years ago by shenwei356 8.4k

0

Entering edit mode

6.1 years ago

5heikki 11k

With awk (assuming tab-separated values):

awk 'BEGIN{OFS=FS="\t"}NR==1{for(i=1;i<=NF;i++){if($i=="geneName"){getline; print $i; exit}}}' inputFile.tsv

ADD COMMENT • link 6.1 years ago by 5heikki 11k

score 5 · Accepted Answer · 2018-02-26

5

Entering edit mode

6.1 years ago

michael.ante ★ 3.8k

Hi,

in order to find the column number you can use:

head -n 1 file | tr '\t' '\n' | cat -n | grep gene72

With head -n 1 , you get only the file's first line. With tr you replace the tab-separator by a new line. With cat -n, you print the input with line numbers on which you use finally grep to get the column of interest.

with the found number - let it be j - you can use cut:

cut -f 1,j file

Cheers,

Michael

ADD COMMENT • link 6.1 years ago by michael.ante ★ 3.8k

0

Entering edit mode

The solution worked perfectly. I might be sounding a bit greedy here, but is there a shorter way to this too?

ADD REPLY • link 6.1 years ago by vinayjrao ▴ 250

0

Entering edit mode

A one-liner would be something like:

cut -f 1,$(head -n 1 file | tr '\t' '\n' | cat -n | grep gene72 | cut -f 1) file

[not tested]

ADD REPLY • link 6.1 years ago by michael.ante ★ 3.8k

0

Entering edit mode

That didn't work. I'll just stick to the previous solution.

Thanks anyway.

ADD REPLY • link 6.1 years ago by vinayjrao ▴ 250

0

Entering edit mode

I've got it. There are leading spaces in the output:

cut -f 1,$(head -n 1 file | tr '\t' '\n' | cat -n | grep gene72 | cut -f 1| sed 's/^\s*//') file

ADD REPLY • link 6.1 years ago by michael.ante ★ 3.8k

0

Entering edit mode

I've made a bash function for getting the header of a file which does the same:

ch(){
cat $1 | head -n 1 | tr '\t' '\n' | nl -n ln
}
export -f ch

I've put this in my .bashrc

You would use this as ch myfile.txt | grep gene72

ADD REPLY • link 6.1 years ago by WouterDeCoster 47k