Question: Extract data using awk/sed and output to different files
1
gravatar for Tao
21 months ago by
Tao170
Tao170 wrote:

Hi guys,

I have a specific problem about using awk or sed to split a big file to different files. The big file is like this format(3 columns):

C    SRR1_45/1    data...
U    SRR2_34/2    data...
U    SRR1_33/2    data...
C    SRR3_22/1    data...
....

I want to extract lines with SRR1 to SRR1.txt, lines with SRR2 to SRR2.txt ...lines with SRRn to SRRn.txt. And the output lines should remove 'SRRi_' symbol. But we don't how many n are there.

e.g. SRR1.txt will contain:
C    45/1    data...
U    33/2    data...

I know it's easy to write a python or perl script to do it. But is there a shell way to do it? taking the advantages of awk or sed. Let me add some details: I have 10 such big files to be extracted. And each has more than 1000M lines. So I need to find a efficient way. The n is random which is not from sequential array.

Thanks! Tao

awk shell sed • 1.2k views
ADD COMMENTlink modified 21 months ago by Alex Reynolds22k • written 21 months ago by Tao170
3
gravatar for Alex Reynolds
21 months ago by
Alex Reynolds22k
Seattle, WA USA
Alex Reynolds22k wrote:

Here is a simple way to do it without sorting and with awk:

$ awk '{ split($2, a, "_"); print $1"\t"a[2]"\t"$3 >> a[1]".txt"; }' foo.txt

The file foo.txt is a three-column tab-delimited text file containing your data.

Using the >> operator appends a line to whatever SRR*.txt file exists. Therefore, if you re-run this one-liner, you must first delete any previously-made SRR*.txt files, or you will get duplicate lines.

This should be pretty fast, as you're not sorting on IDs. It would be faster, probably, to use a Perl-based approach that opens a pool of file handles, but this should work fine.

Further, if you don't care about the order of lines in the split files, you could use GNU Parallel with this one-liner to split multiple files foo1.txt, foo2.txt, etc. simultaneously. Doing the work in parallel may hit a file I/O bottleneck but could give you an overall speed boost, if you use SSDs or other fast storage.

ADD COMMENTlink modified 21 months ago • written 21 months ago by Alex Reynolds22k

Thanks Alex! Your answer is amazing, especially the parallel way you introduced to me. Thank you so much! Best, Tao

ADD REPLYlink written 21 months ago by Tao170
2
gravatar for kloetzl
21 months ago by
kloetzl740
European Union
kloetzl740 wrote:

for ((i=0;i<10;i++)); do grep "SRR${i}_" data | sed "s/SRR.*_//" > SRR${i}.txt; done

Increase the limit as necessary. I leave it to you to delete the empty files.

ADD COMMENTlink modified 21 months ago • written 21 months ago by kloetzl740

great! I think it's a easier one-step work if we know how many n are there.

ADD REPLYlink written 21 months ago by air.chuan.198720

Hi Kloetzl, Thank you for your reply. Your answer is awesome if the i is from sequential array. But unfortunately, i represents a uniq ID which is random. Sorry for the missed information. Best, Tao

ADD REPLYlink written 21 months ago by Tao170

Well, in that case, just read all possible is first.

#!/bin/sh
A=`cat data | grep -o 'SRR.*_' | sort | uniq | tr -cd '0-9\n'`
for i in $A; do
    grep "SRR${i}_" data | sed "s/SRR.*_//" > SRR${i}.txt;
done
ADD REPLYlink modified 21 months ago • written 21 months ago by kloetzl740
1

Thank you for your following up. Will it be efficient to handle big files with more than 1000M lines?

ADD REPLYlink written 21 months ago by Tao170
1

The sorting may take a while, but I don't see an easy way around thaĊ£ atm, without using a "real" programming language. May be there is a way in awk to extract the SRRi part and then print $0 > SRR$i.txt but I am too tired to think about that just now.

ADD REPLYlink written 21 months ago by kloetzl740

Thanks kloetzl! @Alex Reynolds just introduced me an efficient parallel way. It will be great if you are also interested. Tao.

ADD REPLYlink written 21 months ago by Tao170
1
gravatar for air.chuan.1987
21 months ago by
air.chuan.198720 wrote:

I'm also a beginner in GNU so please excuse me if it seems dumb to you: first of all, what is the delimiter of the big file? the following codes are based on TAB as a delimiter:

  1. create a list including all the SRRn

    cut -f 2 input_name |cut -d"_" -f 1 |sort |uniq > list.txt

  2. use while loop to get all you need

    while read -r f1; do grep $f1 input_name |sed "s/${f1}_//g" > $f1.txt ; done < list.txt

you should be able to get what you need. good luck. :)

Charlie

ADD COMMENTlink modified 21 months ago • written 21 months ago by air.chuan.198720

Hi Charlie, Thank you for your reply. Your answer is great and viable. But the file is very big, about 50G, there are more than 1000M lines. So I think it's not very efficient to use cut and sort first. And I have about 10 such big files to be extracted. But your answer is still perfect for small files. Thanks. Tao.

ADD REPLYlink written 21 months ago by Tao170

sure, will be interested to know the most efficient way of doing this as well. Charlie

ADD REPLYlink modified 21 months ago • written 21 months ago by air.chuan.198720

Hi Charlie, @Alex Reynolds give me an excellent solution. It will be very efficient to use the parallel way. Tao.

ADD REPLYlink written 21 months ago by Tao170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1722 users visited in the last hour