Question: Extract a word from inside the text
0
gravatar for mostafarafiepour
6 months ago by
mostafarafiepour60 wrote:

Hi All Dear,

I have a text file, like the following file. I want to extract the name of the genes.

for example:

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

from below input:

ID=id18056;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=NECTIN3;product=nectin cell adhesion molecule
ID=id18065;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=TAGLN3;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=SMG6;product=nectin-3;protein
ID=id18057;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=ERICH1;product=nectin cell adhesion molecule
ID=id18066;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=DLGAP2;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=PPP2R2B;product=nectin-3;protein

What is the best idea?

regex awk R • 373 views
ADD COMMENTlink modified 6 months ago by cpad011211k • written 6 months ago by mostafarafiepour60
3
gravatar for Pierre Lindenbaum
6 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

To extract gene names:

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 
NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

To extract unique gene names

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 | sort | uniq
ADD COMMENTlink modified 6 months ago by zx87547.3k • written 6 months ago by Pierre Lindenbaum120k

many thanks for all answer ...

All answers were great

ADD REPLYlink written 6 months ago by mostafarafiepour60

Now, I've Extract the name of the genes. But there is a problem, because a gene may be in different positions, So its name is copied several times.

Is there a suggestion?

ADD REPLYlink written 6 months ago by mostafarafiepour60
sort | uniq

.

ADD REPLYlink written 6 months ago by Pierre Lindenbaum120k

sorry, How to use sort | uniq?

Do you mean to add it to the previous script?

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 sort | uniq
ADD REPLYlink written 6 months ago by mostafarafiepour60
4

mostafarafiepour, with all due respect: Invest time and search for these absolutely basic answers yourself. This is a bioinformatics Q&A community, intended to help with bioinformatics-related problems, not a basic Unix learning platform. You are lucky people actually answer these kinds of questions. Again, with respect, but if you are already stuck with these most simple things, I am worried that you will run into some severe trouble once analysis gets beyond executing basic Unix scripts. Learn the basics first, plenty of open-source material online on this.

ADD REPLYlink modified 6 months ago • written 6 months ago by ATpoint16k
1
gravatar for ahmad mousavi
6 months ago by
ahmad mousavi420
Royan Institute, Tehran, Iran
ahmad mousavi420 wrote:

Hi

use these code:

# suppose df is your table
df <- gsub("*.gene=","",df)
df <- gsub("[*].*,"",df)

or make delimiter based on ** chars.

ADD COMMENTlink written 6 months ago by ahmad mousavi420

I modified the text file. Before and after the gene, is not **.

ADD REPLYlink written 6 months ago by mostafarafiepour60
1
gravatar for Vijay Lakhujani
6 months ago by
Vijay Lakhujani4.1k
India
Vijay Lakhujani4.1k wrote:

Super fast and easy using grep pattern matching using regex

grep -P '(?<=\*\*gene=)\w+(?=\*\*)' -o gene.txt

where gene.txt if your file name

Output

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

Explanation

-P Means pattern

?<= Left Anchor

?= Right anchor

-o Output only what matched

ADD COMMENTlink modified 6 months ago • written 6 months ago by Vijay Lakhujani4.1k

Excuse me, what is the input file? You only specify the output.

ADD REPLYlink written 6 months ago by mostafarafiepour60

gene.txt is the input file. output is thrown to standard output (stdout)

ADD REPLYlink written 6 months ago by Vijay Lakhujani4.1k

Why do you have the ** in the positive lookbehind assertion? And why do you need the positive lookahead assertion?

grep -oP "(?<=gene=)[^;]+" will suffice, no?

EDIT: cpad is correct, we don't even need the [^;], this will suffice: grep -oP '(?<=;gene=)\w+'

EDIT2: Turns out OP changed data after posting a snippet with the **.

ADD REPLYlink modified 6 months ago • written 6 months ago by RamRS21k

I think your script should change this way.

grep -P '(?<=\;gene=)\w+(?=\;)' -o gene.txt
ADD REPLYlink modified 6 months ago • written 6 months ago by mostafarafiepour60

I get the single quotes and the inclusion of a semi-colon to account for other attributes that may end in gene=, but why include a positive lookahead for a semi-colon?

Also, grep -oP <pattern> <file> is equivalent to grep -P <pattern> -o file, as neither -o not -P is a positional argument.

ADD REPLYlink modified 6 months ago • written 6 months ago by RamRS21k

or this : grep -oP "(?<=gene=)\w+" test.txt ?

ADD REPLYlink modified 6 months ago • written 6 months ago by cpad011211k

Or yes, this. I'd forgotten that \w does not match ;. Thanks, cpad!

ADD REPLYlink modified 6 months ago • written 6 months ago by RamRS21k
0
gravatar for cpad0112
6 months ago by
cpad011211k
India
cpad011211k wrote:
$ sed 's/.*gene=\(\w\+\);.*/\1/g' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B
ADD COMMENTlink written 6 months ago by cpad011211k
1

with awk:

$ awk -F'gene=|;prod' '{print $2}' test.txt

or

$ awk 'gsub(/.*gene=|;product.*/,"")' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B
ADD REPLYlink modified 6 months ago • written 6 months ago by cpad011211k

sed -r is your friend :-)

sed -r 's/.*gene=(\w+);.*/\1/g' test.txt

Although you may wish to add a ; before gene and omit the .* after the second ; :-)

ADD REPLYlink written 6 months ago by RamRS21k
1

I guess you have posted several times about -r option and I keep forgetting using it. Thanks RamRS

ADD REPLYlink written 6 months ago by cpad011211k

Not several, maybe just once more. Once you go -r, you never go back. It's like grep -E. So handy and convenient, makes you wonder why plain grep even exists :-)

ADD REPLYlink written 6 months ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1851 users visited in the last hour