Question: extracting a column by grepping particular rows
0
gravatar for vinayjrao
4 months ago by
vinayjrao100
JNCASR, India
vinayjrao100 wrote:

Hi, I have two files -

file_a

gene_name sample_1 sample_2 sample_3

gene_1 count_1count_1 count_1

gene_2 count_2 count_2 count_2

gene_5 count_3 count_3 count_3

gene_6 count_4 count_4 count_4

file_b

gene_name start end gene_length

gene_1 start_1 end_1 length_1

gene_2 start_2 end_2 length_2

gene_3 start_3 end_3 length_3

gene_4 start_4 end_4 length_4

gene_5 start_5 end_5 length_5

gene_6 start_6 end_6 length_6

gene_7 start_7 end_7 length_7

I want to get the gene lengths for all the genes present in file_a.

I tried using grep, but I don't think there is a way to grep column 1 of file_a with file_b

I also tried grep by taking only the first 2 columns, but it didn't work. Is there a simpler way?

awk grep • 267 views
ADD COMMENTlink modified 4 months ago by cpad01124.5k • written 4 months ago by vinayjrao100
1
gravatar for cpad0112
4 months ago by
cpad01124.5k
cpad01124.5k wrote:

output:

$ join -1 1 -2 1 test1.txt test2.txt -t $'\t'
gene_name   sample_1    sample_2    sample_3    start   end gene_length
gene_1  count_1 count_1 count_1 start_1 end_1   length_1
gene_2  count_2 count_2 count_2 start_2 end_2   length_2
gene_5  count_3 count_3 count_3 start_5 end_5   length_5
gene_6  count_4 count_4 count_4 start_6 end_6   length_6

To get gene length only:

$ join -1 1 -2 1 test1.txt test2.txt -o 1.1,1.2,1.3,1.4,2.4  -t $'\t' 
gene_name   sample_1    sample_2    sample_3    gene_length
gene_1  count_1 count_1 count_1 length_1
gene_2  count_2 count_2 count_2 length_2
gene_5  count_3 count_3 count_3 length_5
gene_6  count_4 count_4 count_4 length_6

input:

$ cat test1.txt 
gene_name   sample_1    sample_2    sample_3
gene_1  count_1 count_1 count_1
gene_2  count_2 count_2 count_2
gene_5  count_3 count_3 count_3
gene_6  count_4 count_4 count_4

$ cat test2.txt 
gene_name   start   end gene_length
gene_1  start_1 end_1   length_1
gene_2  start_2 end_2   length_2
gene_3  start_3 end_3   length_3
gene_4  start_4 end_4   length_4
gene_5  start_5 end_5   length_5
gene_6  start_6 end_6   length_6
gene_7  start_7 end_7   length_7
ADD COMMENTlink modified 4 months ago • written 4 months ago by cpad01124.5k
0
gravatar for Charles Plessy
4 months ago by
Charles Plessy2.4k
Japan
Charles Plessy2.4k wrote:

If your data is not too large, a Unix shell command may be enough:

Here I create files containing your example data, assuming you really meant that it is space-separated.

cat > counts.txt <<_END_
gene_name sample_1 sample_2 sample_3
gene_1 count_1count_1 count_1
gene_2 count_2 count_2 count_2
gene_5 count_3 count_3 count_3
gene_6 count_4 count_4 count_4
_END_

cat > lengths.txt <<_END_
gene_name start end gene_length
gene_1 start_1 end_1 length_1
gene_2 start_2 end_2 length_2
gene_3 start_3 end_3 length_3
gene_4 start_4 end_4 length_4
gene_5 start_5 end_5 length_5
gene_6 start_6 end_6 length_6
gene_7 start_7 end_7 length_7
_END_

_(By the way, providing example data in such a way facilitates the task of those who try to answer your question. Have a try next time !)_

Then, I drop the first (header) line of your first table (sed 1d), I cut the first column, I pass it to grep via the standard input as a list of patterns to match (--file -), I specify that the matches must cover the whole word (otherwise, gene_1 would also match gene_12, etc.), I filter the second table and cut its third column, which contains the lengths.

sed 1d counts.txt |
  cut -f1 -d ' ' |
  grep --file - --word-regexp lengths.txt |
  cut -f4 -d ' '

Beware that it will all break apart if there are variations in the format of your tables.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Charles Plessy2.4k

Dear Charles,

Thanks for your answer. My data is tab-separated, but I shall try the suggestion and update you.

-edit-

I tried the suggestion, but wc -l file_a gives 24424, while wc -l of the output gives 23133 lines.

file_b was downloaded from ucsc (known_gene, latest version). Could there be entries missing in it?

Thanks.

ADD REPLYlink modified 4 months ago • written 4 months ago by vinayjrao100

You can also try following command

awk '{print $1 }' file_a | grep --file - file_b | awk '{print $1"\t"$NF}'

ADD REPLYlink modified 4 months ago • written 4 months ago by Suraj Mahendra Metha50

Dear Suraj,

Thanks for the suggestion, but could you please explain what the command does, because wc -l gave me 36184, while wc -l file_a gives 24424.

ADD REPLYlink written 4 months ago by vinayjrao100

Dear vinayjrao, Can you please upload the part of file and share link with me.

ADD REPLYlink written 4 months ago by Suraj Mahendra Metha50

Do you see the same numbers when following the very good answer of cpad0112 ?

ADD REPLYlink written 3 months ago by Charles Plessy2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 985 users visited in the last hour