Question: How to extract snp names?
0
gravatar for prathikkv_1992
3.8 years ago by
prathikkv_199210 wrote:

Hi,

I am working on some programming projects for biology lab.I need help with extracting snp names from the file which contains snp names along with some quality information. One file x.txt has snp_name, but other file y.txt has Ilm_name. However in y.txt ilm_name follows a pattern of [SnpName]-[0-9]_[A-Z]_[A-Z]_[0-9]. So how do i saperate out only snp name from this pattern?

x.txt:

snp_name
1:10002775-GA
1:100152282-CT
1:100154376-GA
1:100154844-CA
1:100155035-AC
1:100155084-CT
1:100316615-CAG-C
1:10032154-AC
1:100336041-TAGAC-T
1:100340360-CT

y_txt:

ilm_name

1:10002775-GA-0_T_F_2299176856 1 10002775 G + 0 0 0 0
1:100152282-CT-0_T_R_2299204377 1 100152282 G - 0 0 0 0
1:100154376-GA-0_B_R_2299204383 1 100154376 G + 0 0 0 0
1:100154844-CA-0_B_R_2299204393 1 100154844 C + 0 0 0 0
1:100155035-AC-0_T_F_2299204394 1 100155035 A + 0 0 0 0
1:100155084-CT-0_B_F_2299204396 1 100155084 G - 0 0 0 0
1:100182985-CA-0_T_F_2299204412 1 100182985 C + 0 0 0 0
1:100183042-AG-0_B_R_2299204415 1 100183042 A + 0 0 0 0
1:100316615-CAG-C-0_P_F_2304230872 1 100316615 AG + 0 0 0 0
1:10032154-AC-0_B_R_2299176859 1 10032154 A + 0 0 0 0
regex python • 947 views
ADD COMMENTlink modified 3.8 years ago by WouterDeCoster44k • written 3.8 years ago by prathikkv_199210

It would help if you format the content carefully with the button "101010"

ADD REPLYlink written 3.8 years ago by shenwei3565.5k

I formatted the text block for readability. It's always advisable (to make everything clear) to also show an example of what you aim to obtain.

ADD REPLYlink written 3.8 years ago by WouterDeCoster44k
0
gravatar for newbiebio
3.8 years ago by
newbiebio80
newbiebio80 wrote:

I tried a very stupid way, but it works. If you have NotePat++, just press "Ctrl+H", use replace function. In the 'find what', type in

(\w+)-(\w+)-(\w+) (\d+) (\d+) (\w+) ([+-]) (\d) (\d) (\d) (\d)

'replace with',type in

\1-\2

Hope it helps..

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by newbiebio80

That could definitely work. However, I would argue that "manual file tampering" is to be avoided, mainly because you can easily make mistakes and it's rather hard to reproduce the results later on.

ADD REPLYlink written 3.8 years ago by WouterDeCoster44k
0
gravatar for Steven Lakin
3.8 years ago by
Steven Lakin1.5k
Fort Collins, CO, USA
Steven Lakin1.5k wrote:

You've tagged Python in your question, so I'm going to provide that solution as well. What you're looking for is called "capture group regex" and is discussed in this stack overflow answer. If you know that the SNP name is the first or second occurrence in the file for each line, you can pull out that occurrence. Parentheses are typically used to specify where to capture within the regex.

import re
for line in file:
    result = re.findall("(.+)-[0-9]_", line)
    print(result[0])
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Steven Lakin1.5k
0
gravatar for WouterDeCoster
3.8 years ago by
Belgium
WouterDeCoster44k wrote:

You tagged python, but if I understood the question correctly a simple grep -f piped to a cut -f1 would also do the trick:

grep -f x.txt ilm_name | cut -f1 -d' '
ADD COMMENTlink written 3.8 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 725 users visited in the last hour