fetch particular columns from csv files
5
0
Entering edit mode
6.2 years ago

i have file of 10000 rows, consist some values like below.. i need to fetch the columns only consist of rs, value(0-9), but the problem is rs is not always in first column, column will be differ. so how solve this problem. please help me out

rs1 0.5(0.3-1.5) 0.08
rs2 1.2(1.5-5.0) 0.05
snp1 rs6 12 54 1.8(0.8-8.1) 0.02
ght gt rs7 1.2(1.0-2.8) 0.04

Thank you, Anitha

perl python • 1.8k views
1
Entering edit mode

what have you tried ?

0
Entering edit mode

I tried with this regex format to match column that has value(value), "\d.\d+[([]\d.\d+\–\d.\d+[)]]". From this query its fetching whole rows, but i need a columns that only match with my query.

0
Entering edit mode

You should check out grep's options. -o is what you need here. Also, please double check your regex.

1
Entering edit mode

To me it's not clear what you want to obtain, could you (based on your small example) post the desired result?

& (as Pierre asks) it is good practice to show what you tried and the effort you put into it when asking questions on online fora

0
Entering edit mode

You say

rs is not always in first column

and number of 'columns' in a row may not be the same. Then it is not a column data. The data format is not optimal. Try to change that if possible.

If the data can only be like that, checkout regular expression and use it.

3
Entering edit mode
6.2 years ago

Suppose your data is stored in a file called temp.txt Use grep to fetch the column with rs entries

grep -oP 'rs\d+' temp.txt


where

-P : using perl regex where \d corresponds to digits.
-o: returns only the matched string.


Thanks, Persistent LABS, SanGeniX

0
Entering edit mode

1
Entering edit mode

Thank you Ram :)

0
Entering edit mode

This answer does not in fact ensure that the values matched are the complete contents of a "column"; however, I expect that it _probably_ achieves the intent of the original poster, so, +1 from me.

1
Entering edit mode

Word/delimiter boundaries. That's what this needs.

1
Entering edit mode

Agreed. It is not a perfect solution, but it will solve what he expects :) Thanks

2
Entering edit mode
6.2 years ago
Ram 37k

+1 to Pierre and Wouter. I'd recommend grep coupled with word boundaries or separator-based boundaries in your case. You have delimited data that is not necessarily structured, and the one thing you have going for you is the consistent format of dbSNP rs numbers. I'd recommend cleaning the dataset if you wish to get anything more out of it.

1
Entering edit mode
6.2 years ago
Eric T. ★ 2.7k

If the values of interest are not all in the same column, you don't need a tool for extracting individual columns (e.g. the Unix command cut). Instead, you need to search every line for matching strings. You can use the Unix commands grep and sed for this, and/or awk if you're familiar with it.

0
Entering edit mode
6.2 years ago
natasha.sernova ★ 3.9k

csv-columns

or this one if you use Python:

0
Entering edit mode

awk might work in this case, but I never recommend it for parsing CSV to people. It is insensitive to quoted separators, which means you'll end up with an inconsistent number of fields across rows. The csv modules in python/R are the best options for CSV parsing in general.

0
Entering edit mode
6.0 years ago