Question

How to extract snp names?

0

Entering edit mode

7.3 years ago

prathikkv_1992 ▴ 10

Hi,

I am working on some programming projects for biology lab.I need help with extracting snp names from the file which contains snp names along with some quality information. One file x.txt has snp_name, but other file y.txt has Ilm_name. However in y.txt ilm_name follows a pattern of [SnpName]-[0-9]_[A-Z]_[A-Z]_[0-9]. So how do i saperate out only snp name from this pattern?

x.txt:

snp_name
1:10002775-GA
1:100152282-CT
1:100154376-GA
1:100154844-CA
1:100155035-AC
1:100155084-CT
1:100316615-CAG-C
1:10032154-AC
1:100336041-TAGAC-T
1:100340360-CT

y_txt:

ilm_name

1:10002775-GA-0_T_F_2299176856 1 10002775 G + 0 0 0 0
1:100152282-CT-0_T_R_2299204377 1 100152282 G - 0 0 0 0
1:100154376-GA-0_B_R_2299204383 1 100154376 G + 0 0 0 0
1:100154844-CA-0_B_R_2299204393 1 100154844 C + 0 0 0 0
1:100155035-AC-0_T_F_2299204394 1 100155035 A + 0 0 0 0
1:100155084-CT-0_B_F_2299204396 1 100155084 G - 0 0 0 0
1:100182985-CA-0_T_F_2299204412 1 100182985 C + 0 0 0 0
1:100183042-AG-0_B_R_2299204415 1 100183042 A + 0 0 0 0
1:100316615-CAG-C-0_P_F_2304230872 1 100316615 AG + 0 0 0 0
1:10032154-AC-0_B_R_2299176859 1 10032154 A + 0 0 0 0

python regex • 1.6k views

ADD COMMENT • link updated 7.3 years ago by WouterDeCoster 47k • written 7.3 years ago by prathikkv_1992 ▴ 10

0

Entering edit mode

It would help if you format the content carefully with the button "101010"

ADD REPLY • link 7.3 years ago by shenwei356 8.4k

0

Entering edit mode

I formatted the text block for readability. It's always advisable (to make everything clear) to also show an example of what you aim to obtain.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2017-01-13

0

Entering edit mode

7.3 years ago

newbiebio ▴ 80

I tried a very stupid way, but it works. If you have NotePat++, just press "Ctrl+H", use replace function. In the 'find what', type in

(\w+)-(\w+)-(\w+) (\d+) (\d+) (\w+) ([+-]) (\d) (\d) (\d) (\d)

'replace with',type in

\1-\2

Hope it helps..

ADD COMMENT • link 7.3 years ago by newbiebio ▴ 80

0

Entering edit mode

That could definitely work. However, I would argue that "manual file tampering" is to be avoided, mainly because you can easily make mistakes and it's rather hard to reproduce the results later on.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 0 · Answer 2 · 2017-01-13

You've tagged Python in your question, so I'm going to provide that solution as well. What you're looking for is called "capture group regex" and is discussed in this stack overflow answer. If you know that the SNP name is the first or second occurrence in the file for each line, you can pull out that occurrence. Parentheses are typically used to specify where to capture within the regex.

import re
for line in file:
    result = re.findall("(.+)-[0-9]_", line)
    print(result[0])

score 0 · Answer 3 · 2017-01-13

0

Entering edit mode

7.3 years ago

WouterDeCoster 47k

You tagged python, but if I understood the question correctly a simple grep -f piped to a cut -f1 would also do the trick:

grep -f x.txt ilm_name | cut -f1 -d' '

ADD COMMENT • link 7.3 years ago by WouterDeCoster 47k