grep a column based on a string
3
2
Entering edit mode
6.1 years ago
vinayjrao ▴ 250

I have a file with around 20000 columns as gene names. I want to grep out the rpkm values for specific genes. Is there a way to grep out the column information?

sample     gene17     gene92     gene1 ... gene20000

patient1     0.03569654     1.020565     0.0036522 ... 0.25247236

I only want gene72 for example, but it's not sorted in increasing order.

Thanks.

grep awk • 17k views
ADD COMMENT
5
Entering edit mode
6.1 years ago
michael.ante ★ 3.8k

Hi,

in order to find the column number you can use:

head -n 1 file | tr '\t' '\n' | cat -n | grep gene72

With head -n 1 , you get only the file's first line. With tr you replace the tab-separator by a new line. With cat -n, you print the input with line numbers on which you use finally grep to get the column of interest.

with the found number - let it be j - you can use cut:

cut -f 1,j file

Cheers,

Michael

ADD COMMENT
0
Entering edit mode

The solution worked perfectly. I might be sounding a bit greedy here, but is there a shorter way to this too?

ADD REPLY
0
Entering edit mode

A one-liner would be something like:

cut -f 1,$(head -n 1 file | tr '\t' '\n' | cat -n | grep gene72 | cut -f 1) file

[not tested]

ADD REPLY
0
Entering edit mode

That didn't work. I'll just stick to the previous solution.

Thanks anyway.

ADD REPLY
0
Entering edit mode

I've got it. There are leading spaces in the output:

cut -f 1,$(head -n 1 file | tr '\t' '\n' | cat -n | grep gene72 | cut -f 1| sed 's/^\s*//') file
ADD REPLY
0
Entering edit mode

I've made a bash function for getting the header of a file which does the same:

ch(){
cat $1 | head -n 1 | tr '\t' '\n' | nl -n ln
}
export -f ch

I've put this in my .bashrc

You would use this as ch myfile.txt | grep gene72

ADD REPLY
1
Entering edit mode
6.1 years ago

Try csvtk, (usage of csvtk cut).

For tab-delimited file: t.tsv

$ cat t.tsv 
sample  gene17  gene92  gene1   gene20000
patient1        0.03569654      1.020565        0.003652        0.25247236
patient2        0.13569654      1.320565        0.403652        0.95247236

Searching column(s)

$ csvtk cut -t -f sample,gene92 t.tsv                                                      
sample  gene92                                                                                               
patient1        1.020565                                                                                     
patient2        1.320565

$ csvtk cut -t -f sample,gene1,gene92 t.tsv                                                
sample  gene1   gene92
patient1        0.003652        1.020565
patient2        0.403652        1.320565

$ csvtk cut -t -f sample,gene000 t.tsv 
[ERRO] column "gene000" not existed in file: t.tsv
ADD COMMENT
0
Entering edit mode
6.1 years ago
5heikki 11k

With awk (assuming tab-separated values):

awk 'BEGIN{OFS=FS="\t"}NR==1{for(i=1;i<=NF;i++){if($i=="geneName"){getline; print $i; exit}}}' inputFile.tsv
ADD COMMENT

Login before adding your answer.

Traffic: 2254 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6