Picking out the first occurrence of a gene
2
0
Entering edit mode
6.3 years ago
vinayjrao ▴ 250

I have a file -

gene_name chr start end

FAM138A chr1 34553 36081

FAM138A chr1 35244 36073

OR4F5 chr1 69090 70008

RP11-34P13.7 chr1 89294 120932

RP11-34P13.8 chr1 89550 91105

RP11-34P13.7 chr1 92229 129217

I want to pick out the first occurrence of each gene as it would give me the longest transcript. Any help on doing the same would be appreciated.

Thank you.

grep awk • 1.3k views
ADD COMMENT
0
Entering edit mode

What have you tried? It's good practice to show the effort you took to solve this issue, rather than just asking us to solve it completely.

e.g. if you show a bit of Python code I could fix it for you, or show your awk code and you'll automatically summon Pierre Lindenbaum

ADD REPLY
0
Entering edit mode

I have been trying grep. awk, I don't understand very well, so I am keeping that as an option. python, I have no understanding of. I tried grep --max-count=1 "FAM138A" filename and got the desired result, but I want to know how to automate for each gene.

Thanks again.

ADD REPLY
1
Entering edit mode

Is this thread helpful? https://unix.stackexchange.com/questions/160009/remove-entire-row-in-a-file-if-first-column-is-repeated Googled for only keep unique rows based on column unix

ADD REPLY
0
Entering edit mode

That worked perfectly. Thank you very much

ADD REPLY
0
Entering edit mode
6.3 years ago

Hello,

you can do something like this:

cut -f 1 filename|tail -n+2|sort|uniq|parallel grep --max-count=1 {} filename

cut -f 1 filename give us the first column with the genames.

With tail -n+2 we get rid of the first line containing the header.

We than sort the list of geneames as uniq just look for duplicates at the next line(s).

So we end up with a list of all gennames. Using parallel we can pass this list to grep the first occurrence of the gename.

fin swimmer

ADD COMMENT
0
Entering edit mode
6.3 years ago

try this. output:

$ datamash -sH  -g1,2 first 3 first 4  < test.txt 
GroupBy(gene_name)  GroupBy(chr)    first(start)    first(end)
FAM138A chr1    34553   36081
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105

input:

$ cat test.txt 
gene_name   chr start   end
FAM138A chr1    34553   36081
FAM138A chr1    35244   36073
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105
RP11-34P13.7    chr1    92229   129217
ADD COMMENT
0
Entering edit mode

to format output header:

$ datamash -sH  -g1,2 first 3 first 4  < test.txt | sed '1 s/\w\+\W\(\w\+\)\W/\1/g' 
gene_name   chr start   end
FAM138A chr1    34553   36081
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105
ADD REPLY

Login before adding your answer.

Traffic: 2636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6