Question: Picking out the first occurrence of a gene
0
gravatar for vinayjrao
14 months ago by
vinayjrao110
JNCASR, India
vinayjrao110 wrote:

I have a file -

gene_name chr start end

FAM138A chr1 34553 36081

FAM138A chr1 35244 36073

OR4F5 chr1 69090 70008

RP11-34P13.7 chr1 89294 120932

RP11-34P13.8 chr1 89550 91105

RP11-34P13.7 chr1 92229 129217

I want to pick out the first occurrence of each gene as it would give me the longest transcript. Any help on doing the same would be appreciated.

Thank you.

awk grep • 417 views
ADD COMMENTlink modified 14 months ago by cpad011211k • written 14 months ago by vinayjrao110

What have you tried? It's good practice to show the effort you took to solve this issue, rather than just asking us to solve it completely.

e.g. if you show a bit of Python code I could fix it for you, or show your awk code and you'll automatically summon Pierre Lindenbaum

ADD REPLYlink written 14 months ago by WouterDeCoster36k

I have been trying grep. awk, I don't understand very well, so I am keeping that as an option. python, I have no understanding of. I tried grep --max-count=1 "FAM138A" filename and got the desired result, but I want to know how to automate for each gene.

Thanks again.

ADD REPLYlink written 14 months ago by vinayjrao110
1

Is this thread helpful? https://unix.stackexchange.com/questions/160009/remove-entire-row-in-a-file-if-first-column-is-repeated Googled for only keep unique rows based on column unix

ADD REPLYlink written 14 months ago by WouterDeCoster36k

That worked perfectly. Thank you very much

ADD REPLYlink written 14 months ago by vinayjrao110
0
gravatar for finswimmer
14 months ago by
finswimmer9.9k
Germany
finswimmer9.9k wrote:

Hello,

you can do something like this:

cut -f 1 filename|tail -n+2|sort|uniq|parallel grep --max-count=1 {} filename

cut -f 1 filename give us the first column with the genames.

With tail -n+2 we get rid of the first line containing the header.

We than sort the list of geneames as uniq just look for duplicates at the next line(s).

So we end up with a list of all gennames. Using parallel we can pass this list to grep the first occurrence of the gename.

fin swimmer

ADD COMMENTlink written 14 months ago by finswimmer9.9k
0
gravatar for cpad0112
14 months ago by
cpad011211k
India
cpad011211k wrote:

try this. output:

$ datamash -sH  -g1,2 first 3 first 4  < test.txt 
GroupBy(gene_name)  GroupBy(chr)    first(start)    first(end)
FAM138A chr1    34553   36081
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105

input:

$ cat test.txt 
gene_name   chr start   end
FAM138A chr1    34553   36081
FAM138A chr1    35244   36073
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105
RP11-34P13.7    chr1    92229   129217
ADD COMMENTlink written 14 months ago by cpad011211k

to format output header:

$ datamash -sH  -g1,2 first 3 first 4  < test.txt | sed '1 s/\w\+\W\(\w\+\)\W/\1/g' 
gene_name   chr start   end
FAM138A chr1    34553   36081
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105
ADD REPLYlink written 14 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2051 users visited in the last hour