Question: Specific string extraction use Excel, R or Python
1
gravatar for sw.arker
4.5 years ago by
sw.arker60
Germany
sw.arker60 wrote:

Hello,

I want to extract some specific strings in csv file, using either excel, R or Python.

for example as below: I want to find string from column A in column B and return in column C with 5 amino acid before and after N; thanks!!

A B C
INETTDFR
MHRFLLMLLFPFSDNRPMMFFRSFIVFFFLIFFASNVSSRKQTYVIHT
VTTSTKHIVTSLFNSLQTENINDDDFSLPEIHYIYENAMSGFSATLTDDQLDT
VKNTKGFISAYPDELLSLHTTYSHEFLGLEFGIGLWNETSLSSDVIIGLVDTG
ISPEHVSFRDTHMTPVPSRWRGSCDEGTNFSSSECNKKIIGASAFYKGYE
SIVGKINETTDFRSTRDAQGHGTHTASTAAGDIVPKANYFGQAKGLASGM
RFTSRIAAYKACWALGCASTDVIAAIDRAILDGVDVISLSLGGSSRPFYVDP
IAIAGFGAMQKNIFVSCSAGNSGPTASTVSNGAPWLMTVAASYTDRTFPAIV
RIGNRKSLVGSSLYKGKSLKNLPLAFNRTAGEESGAVFCIRDSLKRELVEGK
IVICLRGASGRTAKGEEVKRSGGAAMLLVSTEAEGEELLADPHVLPAVSLGF
SDGKTLLNYLAGAANATASVRFRGTAYGATAPMVAAFSSRGPSVAGPEIAKP
DIAAPGLNILAGWSPFSSPSLLRSDPRRVQFNIISGTSMACPHISGIAALIKSV
HGDWSPAMIKSAIMTTARITDNRNRPIGDRGAAGAESAATAFAFGAGNVDPT
RAVDPGLVYDTSTVDYLNYLCSLNYTSERILLFSGTNYTCASNAVVLSPGDLN
YPSFAVNLVNGANLKTVRYKRTVTNVGSPTCEYMVHVEEPKGVKVRVEPKVL
KFQKARERLSYTVTYDAEASRNSSSSSFGVLVWICDKYNVRSPIAVTWE
IVGKINETTDF

 

python excel R • 2.8k views
ADD COMMENTlink modified 10 months ago by RamRS20k • written 4.5 years ago by sw.arker60
2

OK, so what have you tried? This should be pretty straight forward in python or R (no clue about excel).

ADD REPLYlink written 4.5 years ago by Devon Ryan88k
1

Provided example doesn't seem to make sense:

A = INETTDFR

B = ...IGASAFYKGYESIVGKINETTDFRSTRDAQGHGTHTAST...

C = IVGKINETTDF

 

ADD REPLYlink written 4.5 years ago by zx87546.5k

the goal is to extract the string from B (whole protein sequence) according to A (identified peptide sequence window), and output as C but taking addtional 5 amino acid before and after N. 

ADD REPLYlink written 4.5 years ago by sw.arker60
1

The comment was due to the example not including the 5 amino acids following the matched string (in fact, it didn't even include the entire matched string).
 

ADD REPLYlink written 4.5 years ago by Devon Ryan88k

the output does not include the entire string A, that's ture. because the aim is to get the sequence window with "N" in the middle and 5 aa in front and 5 aa after. the original identified peptide string A did not provide the uniform peptide sequence window with "N" in the middle. That's what I am trying to get by pulling out the specific sequence window from the original protein sequence.

ADD REPLYlink written 4.5 years ago by sw.arker60
1

Ah, both zx8754 and I misread then. I think we both took N as a variable needle.

ADD REPLYlink written 4.5 years ago by Devon Ryan88k
1

sorry for the confusion, the N (asparagine) which is potential glycosylated, and I am trying to get the peptide uniform window with the N in the middle. It makes easier and suitable for pattern analysis. 

ADD REPLYlink written 4.5 years ago by sw.arker60
1

I modified my script to anchor on the residue (or residues) of interest within A. I think it should work but you probably would want to test it, first.

ADD REPLYlink written 4.5 years ago by Alex Reynolds27k

thanks! Alex, I will definitely try it. by the way, I just find another complicated way to get it by using excel and regular expression tool, just as my backup ;-)

ADD REPLYlink written 4.5 years ago by sw.arker60

This is a basic programming task, check out the csv module in python.

ADD REPLYlink written 4.5 years ago by pld4.8k

Thanks Alex!!

I will try it and modify it accordingly. 

cheers!

ADD REPLYlink written 4.5 years ago by sw.arker60
7
gravatar for zx8754
4.5 years ago by
zx87546.5k
London
zx87546.5k wrote:

Using Excel*:

=MID(B1,FIND(A1,B1)+FIND("N",A1)-6,5)&"N"&MID(B1,FIND(A1,B1)+FIND("N",A1),5)

*Don't use Excel :)

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by zx87546.5k

super!! thanks zx8754, it works. actually I was using small part of the same code in combination of regular expression webtool to accomplish the job ;-)

ADD REPLYlink written 4.5 years ago by sw.arker60
1

I am a rookie in coding (most excel depending), but more and more I figure out for bigger and complicated task, coding makes it more easier and productive!! 

ADD REPLYlink written 4.5 years ago by sw.arker60
6
gravatar for Guangchuang Yu
4.5 years ago by
Guangchuang Yu2.1k
China/Hong Kong/The University of Hong Kong
Guangchuang Yu2.1k wrote:

Using R:

getPreceding <- function(A, B, N = 4) {
  x <- regexpr(A, B)
  substring(B, x - N, x + attr(x, "match.length") - 1)
}

A = "INETTDFR"
B = "...IGASAFYKGYESIVGKINETTDFRSTRDAQGHGTHTAST..."
getPreceding(A, B)
# [1] "IVGKINETTDFR"
ADD COMMENTlink modified 10 months ago by zx87546.5k • written 4.5 years ago by Guangchuang Yu2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1801 users visited in the last hour