Question: Extract a word from inside the text
0
gravatar for mostafarafiepour
21 months ago by
mostafarafiepour70 wrote:

Hi All Dear,

I have a text file, like the following file. I want to extract the name of the genes.

for example:

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

from below input:

ID=id18056;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=NECTIN3;product=nectin cell adhesion molecule
ID=id18065;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=TAGLN3;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=SMG6;product=nectin-3;protein
ID=id18057;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=ERICH1;product=nectin cell adhesion molecule
ID=id18066;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=DLGAP2;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=PPP2R2B;product=nectin-3;protein

What is the best idea?

regex awk R • 720 views
ADD COMMENTlink modified 21 months ago by cpad011213k • written 21 months ago by mostafarafiepour70
3
gravatar for Pierre Lindenbaum
21 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

To extract gene names:

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 
NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

To extract unique gene names

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 | sort | uniq
ADD COMMENTlink modified 21 months ago by zx87549.4k • written 21 months ago by Pierre Lindenbaum129k

many thanks for all answer ...

All answers were great

ADD REPLYlink written 21 months ago by mostafarafiepour70

Now, I've Extract the name of the genes. But there is a problem, because a gene may be in different positions, So its name is copied several times.

Is there a suggestion?

ADD REPLYlink written 21 months ago by mostafarafiepour70
sort | uniq

.

ADD REPLYlink written 21 months ago by Pierre Lindenbaum129k

sorry, How to use sort | uniq?

Do you mean to add it to the previous script?

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 sort | uniq
ADD REPLYlink written 21 months ago by mostafarafiepour70
4

mostafarafiepour, with all due respect: Invest time and search for these absolutely basic answers yourself. This is a bioinformatics Q&A community, intended to help with bioinformatics-related problems, not a basic Unix learning platform. You are lucky people actually answer these kinds of questions. Again, with respect, but if you are already stuck with these most simple things, I am worried that you will run into some severe trouble once analysis gets beyond executing basic Unix scripts. Learn the basics first, plenty of open-source material online on this.

ADD REPLYlink modified 21 months ago • written 21 months ago by ATpoint36k
1
gravatar for ahmad mousavi
21 months ago by
ahmad mousavi480
Royan Institute, Tehran, Iran
ahmad mousavi480 wrote:

Hi

use these code:

# suppose df is your table
df <- gsub("*.gene=","",df)
df <- gsub("[*].*,"",df)

or make delimiter based on ** chars.

ADD COMMENTlink written 21 months ago by ahmad mousavi480

I modified the text file. Before and after the gene, is not **.

ADD REPLYlink written 21 months ago by mostafarafiepour70
1
gravatar for lakhujanivijay
21 months ago by
lakhujanivijay5.1k
India
lakhujanivijay5.1k wrote:

Super fast and easy using grep pattern matching using regex

grep -P '(?<=\*\*gene=)\w+(?=\*\*)' -o gene.txt

where gene.txt if your file name

Output

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

Explanation

-P Means pattern

?<= Left Anchor

?= Right anchor

-o Output only what matched

ADD COMMENTlink modified 21 months ago • written 21 months ago by lakhujanivijay5.1k

Excuse me, what is the input file? You only specify the output.

ADD REPLYlink written 21 months ago by mostafarafiepour70

gene.txt is the input file. output is thrown to standard output (stdout)

ADD REPLYlink written 21 months ago by lakhujanivijay5.1k

Why do you have the ** in the positive lookbehind assertion? And why do you need the positive lookahead assertion?

grep -oP "(?<=gene=)[^;]+" will suffice, no?

EDIT: cpad is correct, we don't even need the [^;], this will suffice: grep -oP '(?<=;gene=)\w+'

EDIT2: Turns out OP changed data after posting a snippet with the **.

ADD REPLYlink modified 21 months ago • written 21 months ago by RamRS28k

I think your script should change this way.

grep -P '(?<=\;gene=)\w+(?=\;)' -o gene.txt
ADD REPLYlink modified 21 months ago • written 21 months ago by mostafarafiepour70

I get the single quotes and the inclusion of a semi-colon to account for other attributes that may end in gene=, but why include a positive lookahead for a semi-colon?

Also, grep -oP <pattern> <file> is equivalent to grep -P <pattern> -o file, as neither -o not -P is a positional argument.

ADD REPLYlink modified 21 months ago • written 21 months ago by RamRS28k

or this : grep -oP "(?<=gene=)\w+" test.txt ?

ADD REPLYlink modified 21 months ago • written 21 months ago by cpad011213k

Or yes, this. I'd forgotten that \w does not match ;. Thanks, cpad!

ADD REPLYlink modified 21 months ago • written 21 months ago by RamRS28k
0
gravatar for cpad0112
21 months ago by
cpad011213k
India
cpad011213k wrote:
$ sed 's/.*gene=\(\w\+\);.*/\1/g' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B
ADD COMMENTlink written 21 months ago by cpad011213k
1

with awk:

$ awk -F'gene=|;prod' '{print $2}' test.txt

or

$ awk 'gsub(/.*gene=|;product.*/,"")' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B
ADD REPLYlink modified 21 months ago • written 21 months ago by cpad011213k

sed -r is your friend :-)

sed -r 's/.*gene=(\w+);.*/\1/g' test.txt

Although you may wish to add a ; before gene and omit the .* after the second ; :-)

ADD REPLYlink written 21 months ago by RamRS28k
1

I guess you have posted several times about -r option and I keep forgetting using it. Thanks RamRS

ADD REPLYlink written 21 months ago by cpad011213k

Not several, maybe just once more. Once you go -r, you never go back. It's like grep -E. So handy and convenient, makes you wonder why plain grep even exists :-)

ADD REPLYlink written 21 months ago by RamRS28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1796 users visited in the last hour