Question: grep a column based on a string
2
gravatar for vinayjrao
4 months ago by
vinayjrao100
JNCASR, India
vinayjrao100 wrote:

I have a file with around 20000 columns as gene names. I want to grep out the rpkm values for specific genes. Is there a way to grep out the column information?

sample     gene17     gene92     gene1 ... gene20000

patient1     0.03569654     1.020565     0.0036522 ... 0.25247236

I only want gene72 for example, but it's not sorted in increasing order.

Thanks.

awk grep • 488 views
ADD COMMENTlink modified 4 months ago by shenwei3564.0k • written 4 months ago by vinayjrao100
4
gravatar for michael.ante
4 months ago by
michael.ante2.5k
Austria/Vienna
michael.ante2.5k wrote:

Hi,

in order to find the column number you can use:

head -n 1 file | tr '\t' '\n' | cat -n | grep gene72

With head -n 1 , you get only the file's first line. With tr you replace the tab-separator by a new line. With cat -n, you print the input with line numbers on which you use finally grep to get the column of interest.

with the found number - let it be j - you can use cut:

cut -f 1,j file

Cheers,

Michael

ADD COMMENTlink written 4 months ago by michael.ante2.5k

The solution worked perfectly. I might be sounding a bit greedy here, but is there a shorter way to this too?

ADD REPLYlink written 4 months ago by vinayjrao100

A one-liner would be something like:

cut -f 1,$(head -n 1 file | tr '\t' '\n' | cat -n | grep gene72 | cut -f 1) file

[not tested]

ADD REPLYlink written 4 months ago by michael.ante2.5k

That didn't work. I'll just stick to the previous solution.

Thanks anyway.

ADD REPLYlink written 4 months ago by vinayjrao100

I've got it. There are leading spaces in the output:

cut -f 1,$(head -n 1 file | tr '\t' '\n' | cat -n | grep gene72 | cut -f 1| sed 's/^\s*//') file
ADD REPLYlink written 4 months ago by michael.ante2.5k

I've made a bash function for getting the header of a file which does the same:

ch(){
cat $1 | head -n 1 | tr '\t' '\n' | nl -n ln
}
export -f ch

I've put this in my .bashrc

You would use this as ch myfile.txt | grep gene72

ADD REPLYlink written 4 months ago by WouterDeCoster30k
1
gravatar for shenwei356
4 months ago by
shenwei3564.0k
China
shenwei3564.0k wrote:

Try csvtk, (usage of csvtk cut).

For tab-delimited file: t.tsv

$ cat t.tsv 
sample  gene17  gene92  gene1   gene20000
patient1        0.03569654      1.020565        0.003652        0.25247236
patient2        0.13569654      1.320565        0.403652        0.95247236

Searching column(s)

$ csvtk cut -t -f sample,gene92 t.tsv                                                      
sample  gene92                                                                                               
patient1        1.020565                                                                                     
patient2        1.320565

$ csvtk cut -t -f sample,gene1,gene92 t.tsv                                                
sample  gene1   gene92
patient1        0.003652        1.020565
patient2        0.403652        1.320565

$ csvtk cut -t -f sample,gene000 t.tsv 
[ERRO] column "gene000" not existed in file: t.tsv
ADD COMMENTlink modified 4 months ago • written 4 months ago by shenwei3564.0k
0
gravatar for 5heikki
4 months ago by
5heikki7.4k
Finland
5heikki7.4k wrote:

With awk (assuming tab-separated values):

awk 'BEGIN{OFS=FS="\t"}NR==1{for(i=1;i<=NF;i++){if($i=="geneName"){getline; print $i; exit}}}' inputFile.tsv
ADD COMMENTlink written 4 months ago by 5heikki7.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1674 users visited in the last hour