Extract data using awk/sed and output to different files
3
1
Entering edit mode
5.8 years ago
Tao ▴ 460

Hi guys,

I have a specific problem about using awk or sed to split a big file to different files. The big file is like this format(3 columns):

C    SRR1_45/1    data...
U    SRR2_34/2    data...
U    SRR1_33/2    data...
C    SRR3_22/1    data...
....


I want to extract lines with SRR1 to SRR1.txt, lines with SRR2 to SRR2.txt ...lines with SRRn to SRRn.txt. And the output lines should remove 'SRRi_' symbol. But we don't how many n are there.

e.g. SRR1.txt will contain:
C    45/1    data...
U    33/2    data...


I know it's easy to write a python or perl script to do it. But is there a shell way to do it? taking the advantages of awk or sed. Let me add some details: I have 10 such big files to be extracted. And each has more than 1000M lines. So I need to find a efficient way. The n is random which is not from sequential array.

Thanks! Tao

awk sed shell • 7.9k views
4
Entering edit mode
5.8 years ago

Here is a simple way to do it without sorting and with awk:

$awk '{ split($2, a, "_"); print $1"\t"a[2]"\t"$3 >> a[1]".txt"; }' foo.txt


The file foo.txt is a three-column tab-delimited text file containing your data.

Using the >> operator appends a line to whatever SRR*.txt file exists. Therefore, if you re-run this one-liner, you must first delete any previously-made SRR*.txt files, or you will get duplicate lines.

This should be pretty fast, as you're not sorting on IDs. It would be faster, probably, to use a Perl-based approach that opens a pool of file handles, but this should work fine.

Further, if you don't care about the order of lines in the split files, you could use GNU Parallel with this one-liner to split multiple files foo1.txt, foo2.txt, etc. simultaneously. Doing the work in parallel may hit a file I/O bottleneck but could give you an overall speed boost, if you use SSDs or other fast storage.

0
Entering edit mode

Thanks Alex! Your answer is amazing, especially the parallel way you introduced to me. Thank you so much! Best, Tao

2
Entering edit mode
5.8 years ago
kloetzl ★ 1.1k

for ((i=0;i<10;i++)); do grep "SRR${i}_" data | sed "s/SRR.*_//" > SRR${i}.txt; done 

Increase the limit as necessary. I leave it to you to delete the empty files.

0
Entering edit mode

great! I think it's a easier one-step work if we know how many n are there.

0
Entering edit mode

Hi Kloetzl, Thank you for your reply. Your answer is awesome if the i is from sequential array. But unfortunately, i represents a uniq ID which is random. Sorry for the missed information. Best, Tao

0
Entering edit mode

Well, in that case, just read all possible is first.

#!/bin/sh
A=cat data | grep -o 'SRR.*_' | sort | uniq | tr -cd '0-9\n'
for i in $A; do grep "SRR${i}_" data | sed "s/SRR.*_//" > SRR${i}.txt; done  ADD REPLY 1 Entering edit mode Thank you for your following up. Will it be efficient to handle big files with more than 1000M lines? ADD REPLY 1 Entering edit mode The sorting may take a while, but I don't see an easy way around thaţ atm, without using a "real" programming language. May be there is a way in awk to extract the SRRi part and then print$0 > SRR$i.txt but I am too tired to think about that just now. ADD REPLY 0 Entering edit mode Thanks kloetzl! @Alex Reynolds just introduced me an efficient parallel way. It will be great if you are also interested. Tao. ADD REPLY 1 Entering edit mode 5.8 years ago I'm also a beginner in GNU so please excuse me if it seems dumb to you: first of all, what is the delimiter of the big file? the following codes are based on TAB as a delimiter: 1. create a list including all the SRRn cut -f 2 input_name |cut -d"_" -f 1 |sort |uniq > list.txt 2. use while loop to get all you need while read -r f1; do grep$f1 input_name |sed "s/${f1}_//g" >$f1.txt ; done < list.txt

you should be able to get what you need. good luck. :)

Charlie

0
Entering edit mode

Hi Charlie, Thank you for your reply. Your answer is great and viable. But the file is very big, about 50G, there are more than 1000M lines. So I think it's not very efficient to use cut and sort first. And I have about 10 such big files to be extracted. But your answer is still perfect for small files. Thanks. Tao.

0
Entering edit mode

sure, will be interested to know the most efficient way of doing this as well. Charlie

0
Entering edit mode

Hi Charlie, @Alex Reynolds give me an excellent solution. It will be very efficient to use the parallel way. Tao.