How to extract snp names?
3
0
Entering edit mode
7.3 years ago

Hi,

I am working on some programming projects for biology lab.I need help with extracting snp names from the file which contains snp names along with some quality information. One file x.txt has snp_name, but other file y.txt has Ilm_name. However in y.txt ilm_name follows a pattern of [SnpName]-[0-9]_[A-Z]_[A-Z]_[0-9]. So how do i saperate out only snp name from this pattern?

x.txt:

snp_name
1:10002775-GA
1:100152282-CT
1:100154376-GA
1:100154844-CA
1:100155035-AC
1:100155084-CT
1:100316615-CAG-C
1:10032154-AC
1:100336041-TAGAC-T
1:100340360-CT

y_txt:

ilm_name

1:10002775-GA-0_T_F_2299176856 1 10002775 G + 0 0 0 0
1:100152282-CT-0_T_R_2299204377 1 100152282 G - 0 0 0 0
1:100154376-GA-0_B_R_2299204383 1 100154376 G + 0 0 0 0
1:100154844-CA-0_B_R_2299204393 1 100154844 C + 0 0 0 0
1:100155035-AC-0_T_F_2299204394 1 100155035 A + 0 0 0 0
1:100155084-CT-0_B_F_2299204396 1 100155084 G - 0 0 0 0
1:100182985-CA-0_T_F_2299204412 1 100182985 C + 0 0 0 0
1:100183042-AG-0_B_R_2299204415 1 100183042 A + 0 0 0 0
1:100316615-CAG-C-0_P_F_2304230872 1 100316615 AG + 0 0 0 0
1:10032154-AC-0_B_R_2299176859 1 10032154 A + 0 0 0 0
python regex • 1.6k views
ADD COMMENT
0
Entering edit mode

It would help if you format the content carefully with the button "101010"

ADD REPLY
0
Entering edit mode

I formatted the text block for readability. It's always advisable (to make everything clear) to also show an example of what you aim to obtain.

ADD REPLY
0
Entering edit mode
7.3 years ago
newbiebio ▴ 80

I tried a very stupid way, but it works. If you have NotePat++, just press "Ctrl+H", use replace function. In the 'find what', type in

(\w+)-(\w+)-(\w+) (\d+) (\d+) (\w+) ([+-]) (\d) (\d) (\d) (\d)

'replace with',type in

\1-\2

Hope it helps..

ADD COMMENT
0
Entering edit mode

That could definitely work. However, I would argue that "manual file tampering" is to be avoided, mainly because you can easily make mistakes and it's rather hard to reproduce the results later on.

ADD REPLY
0
Entering edit mode
7.3 years ago
Steven Lakin ★ 1.8k

You've tagged Python in your question, so I'm going to provide that solution as well. What you're looking for is called "capture group regex" and is discussed in this stack overflow answer. If you know that the SNP name is the first or second occurrence in the file for each line, you can pull out that occurrence. Parentheses are typically used to specify where to capture within the regex.

import re
for line in file:
    result = re.findall("(.+)-[0-9]_", line)
    print(result[0])
ADD COMMENT
0
Entering edit mode
7.3 years ago

You tagged python, but if I understood the question correctly a simple grep -f piped to a cut -f1 would also do the trick:

grep -f x.txt ilm_name | cut -f1 -d' '
ADD COMMENT

Login before adding your answer.

Traffic: 2655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6