Improving efficiency of awk for large files
2
2
Entering edit mode
7.5 years ago
AW ▴ 350

Hello,

I have two files that I want to merge when the first two columns (scaffold, base position) are identical

example, file 1

scaffold_1    31    1    2    20    0
scaffold_1    32    19    1    0    0

example file 2

scaffold_1    31    0    0    0    20
scaffold_1    261    20    0    0    0

I can merge these files using awk 'NR==FNR{a[$1,$2]="\t"$3"\t"$4"\t"$5"\t"$6;next} ($1,$2) in a {print $0, a[$1,$2]}' file1 file2 to get

scaffold_1    31    0    0    0    20    1    2    20    0

However, the files are ~15GB each and this awk command loads the first file into memory. It is therefore very slow! I would be grateful if anyone has any other suggestions for making this much quicker!

Thanks!

awk • 10k views
ADD COMMENT
0
Entering edit mode

Are both of these files already sorted on the first two columns? How badly do you need a solution in AWK?

ADD REPLY
1
Entering edit mode

Hi, Thanks for your quick reply!

The files are not sorted and there are rows in file1 that are not present in file2 and vice versa. I don't need to use awk, I just couldn't find another solution. If you have one I would love to hear it, thanks!

ADD REPLY
1
Entering edit mode
What you're looking for is called a database. Load the files into SQLite or MySql and join them there. It will be very fast.
ADD REPLY
1
Entering edit mode

Thanks!! Do you have any tips to get me started, I've downloaded SQLite but no idea where to begin!

ADD REPLY
0
Entering edit mode

I mean, two 15Gb SQLite tables with an index on both for the first two columns is going to be close to 50+ Gb. MySQL/Postgres will be even bigger. Databases are a very simple and clean solution (which could help down the road), but it will be pretty slow too. One sec, I'll write some quick and dirty Python to do this..

ADD REPLY
1
Entering edit mode

That would be amazing thanks so much! I've tried for the last couple of days to write a python script but it was running for 7+ hours!

ADD REPLY
0
Entering edit mode

Here is a python script that doesn't need to sort:

I would be interested to know how it performs compared to the sort-based implimentations. Use it like:

script.py -i file1 file2 ... fileN

You can also specify an --output directory, and a --delimiter if its not tab

NOTE: It keeps the rows unique to each file - which you said should never happen. So the output of the demo is:

# Input file order:     file1      file2
scaffold_1      32      19      1       0       0
scaffold_1      31      1       1       20      0       0       0       0       20
scaffold_1      261     20      0       0       0

To change that to drop the shorter rows is trivial - but I would be inclined to leave it as it is then test later that all rows have the same number of columns with something like:

cat joinedOutput.txt | awk '{print NF}' | uniq | sort | uniq
ADD REPLY
0
Entering edit mode

Usually shell cmds (implemented by C), including AWK, are quicker than scripts implemented by scripting language like Python for simple problems. Please @AW paste the time comparison if interested.

ADD REPLY
0
Entering edit mode

I definitely will thanks! Your suggestion has been working very well but unfortunately the sort step is taking an extremely long time. I googled this and it seems that I might expect the sort step to run for several hours. Do you have any alternatives for this?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Yes I saw this but it didn't work for me. And when I ran sort --help there was no option for parallel? Any idea why this might be? Thanks so much for your help!!!!

ADD REPLY
1
Entering edit mode

update your sort.

ADD REPLY
0
Entering edit mode

Thanks I'll try that!

ADD REPLY
0
Entering edit mode

Ah, im not so sure Zhilong. It really depends on who writes the script and what the problem is.
In this example, i'm very confident that the Python will be several times faster than the Unix pipeline, because there's no sorting required. The join is done in RAM because it will fit after being split into chromosomes. There is no way to do that by combining different Unix tools, and so even though they are faster at what they do individually - they don't allow you to take shortcuts. Some might argue that they prevent you from even "seeing" the shortcuts, because you are limited by the tools in your toolbox.

ADD REPLY
0
Entering edit mode

ha, I pasted the awk version, too. Here the problem is memory limit (similar to python).

ADD REPLY
0
Entering edit mode

Amazing thank you very much for this!! I have started running it now and will let you know how it goes. There are many rows that are unique to each file and I'm not really interested in these so I may try to remove these later. Thanks again!

ADD REPLY
0
Entering edit mode

Oh whoops i misread your comment about rows in file1 not in file2. I read they are unsorted but the first two columns always appear in both. If id known, i'd have added an extra line to not write those lines in the output. Soz!

ADD REPLY
1
Entering edit mode

No worries! I'll have a go at modifying the script but might ask for some advice if I get stuck! Thanks!

ADD REPLY
0
Entering edit mode

Ah it's OK, it's only a two second tweak: http://paste.ofcode.org/qHkDxaa2M7KGc4ddZa9u57

This will now only keep the rows where every input file was combined. It will run quicker than before too, since there's less data to write out :P

But you"ll need to keep a look out for how much RAM is being used. It should be around the size of all the chromosome 1s combined. A structured numpy array or something it would be a lot smaller, but because that comes with a lot of nasty edge cases it would take me the whole day to write and debug so this will have to do :P

ADD REPLY
0
Entering edit mode

Wow thanks that's great!

ADD REPLY
1
Entering edit mode

I ran into an error!

File "merge_profiles.py", line 26, in <module>
    chromosomeObj[chromosome] = open('./temp-'+chromosome, 'a')
IOError: [Errno 24] Too many open files: './temp-scaffold_3011'

Any ideas?

ADD REPLY
0
Entering edit mode

Ah, I'm guessing you're on a Mac? They have a really low limit for the maximum number of open files (256 by default), while many Linux PCs will have a default of 65536. I always do this with chromosomes so I never go above 60-80 contigs, but if you have a lot of scaffolds I guess you will hit it. You can check with:

ulimit -n

and if its low you can increase it like

ulimit -n 65536

This change only lasts for the current terminal session, and will have to be re-applied when you open a new terminal. You can set it system-wide, but I wouldnt. Macs dont like it when you change their internals.

Hope that helps mate :/ I feel bad that your trying to do this simple thing, but you keep running up against these walls :P

On the plus side, having more than 256 scaffolds means that the in-memory join will be very unlikely to run out of RAM :)

ADD REPLY
2
Entering edit mode
7.5 years ago
Zhilong Jia ★ 2.1k

I believe the several shell commands can help you with a quick solution.

imaging tab is the separator in your files and there are no "#" in your files.

Firstly, combine your first two columns like:

$sed 's/\t/\#/' file1 > t1

$sed 's/\t/\#/'  file2 > t2

$cat t1 t2                                                                                                                                       
caffold_1#31    1    2    20    0
scaffold_1#32    19    1    0    0

caffold_1#31    0    0    0    20
scaffold_1#261    20    0    0    0

Secondly, sort them respectively

$sort t1 > t1.sort
$sort t2 > t2.sort

Then, join two files:

$join t2.sort t1.sort   > t3
$cat t3                                                                                                                 
caffold_1#31 0 0 0 20 1 2 20 0

Finally, substitute # with tab

$sed -i 's/\#/\t/' t3    
$cat t3                                                                                                                
caffold_1 31 0 0 0 20 1 2 20 0

Update (awk version if your memory is large enough):

$awk 'BEGIN{OFS=FS="\t"}NR==FNR{key=$1 "_" $2; $1=$2=""; data[key]=$0; next}{if ($1 "_" $2 in data) print ($0, data[$1 "_" $2])}' t1 t2 | sed 's/\t\t\t/\t/' > t3
ADD COMMENT
0
Entering edit mode

Thanks, Im trying that now!

ADD REPLY
0
Entering edit mode

Updated.

ADD REPLY
0
Entering edit mode

Thanks! I still can't get it to work though

% sed 's/ /\#/' fem.text > t1
% sed 's/ /\#/' mal.text > t2
% cat t1 t2

scaffold_1    31    0    0    0    20
scaffold_1    261    20    0    0    0
scaffold_1    31    0    0    20    0
scaffold_1    32    19    1    0    0

% sort t1 > t1.sort
% sort t2 > t2.sort

% join t2.sort t1.sort > t3

% cat t3
scaffold_1 31 0 0 20 0 261 20 0 0 0
scaffold_1 31 0 0 20 0 31 0 0 0 20
scaffold_1 32 19 1 0 0 261 20 0 0 0
scaffold_1 32 19 1 0 0 31 0 0 0 20
ADD REPLY
1
Entering edit mode

due to the separator (it was space, now is tab). updated.

ADD REPLY
0
Entering edit mode

I think it may need a $ before as well. It only ran on mine as sed $'s/\t/\#/' file 1.

It is running quickly now!! Thanks very much!

ADD REPLY
0
Entering edit mode

"$" only indicates the following is a shell command.

ADD REPLY
0
Entering edit mode

ok thanks, good to know!

ADD REPLY
0
Entering edit mode

How does join achieve this? And why do the files need to be sorted?

ADD REPLY
0
Entering edit mode

Just man join to find the join manual and you will find how it works and why sort is necessary.

ADD REPLY
0
Entering edit mode

Yey I fixed it and it works, thanks!

Should be

sed $'s/\t/\#/' file 1
ADD REPLY
2
Entering edit mode
7.5 years ago

A very fast and simple solution that doesn't appear to require sorting (in your example data) is to use awk to pre-process each line into a BED file, where pseudo-IDs are made from the third and subsequent fields. We can then use BEDOPS to map the merged files onto themselves, where there are overlaps. The resulting self-mapped elements can be post-processed back to the desired end format.

For instance, we can make one-base BED files from your inputs:

$ awk '{print $1"\t"$2"\t"($2 + 1)"\t"$3"_"$4"_"$5"_"$6}' setA.txt > setA.bed
$ awk '{print $1"\t"$2"\t"($2 + 1)"\t"$3"_"$4"_"$5"_"$6}' setB.txt > setB.bed

These are sorted elements, so we can union them directly:

$ bedops --everything setA.bed setB.bed > unionAB.bed

Then we map the set union against itself, with a bash substitution to strip the ID column from reference elements for convenience:

$ bedmap --echo --echo-map-id --faster --delim '\t' <(cut -f1-3 unionAB.bed) unionAB.bed | uniq > unionABSelfMapped.bed

Let's take a quick look at unionABSelfMapped.bed:

$ less unionABSelfMapped.bed
scaffold_1      31      32      1_2_20_0;0_0_0_20
scaffold_1      32      33      19_1_0_0
scaffold_1      261     262     20_0_0_0

We can post-process this self-mapped result into a format that matches your desired result, skipping un-merged elements and turning the ID field back into a series of tab-delimited strings:

$ awk '{ \
    fsn = split($4, fs, ";"); \
    if (fsn > 1) { \
        split(fs[1], fs1, "_"); \
        split(fs[2], fs2, "_"); \
        printf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, fs1[1], fs1[2], fs1[3], fs1[4], fs2[1], fs2[2], fs2[3], fs2[4]); \
    } \
}' unionABSelfMapped.bed > answer.txt

The result:

$ less answer.txt
scaffold_1      31      1       2       20      0       0       0       0       20

The reason I suggest this approach is two-fold:

  1. Walking through files and processing them one line at a time requires very very little memory and is relatively fast, especially at the multi-GB scale!
  2. Sorting is very expensive at that scale, because unless you have huge amounts of system memory to throw at sorting, you usually have to use temporary files - and disk I/O is very expensive! If your inputs are already ordered, you can pre-process them into sorted BED files very cheaply and start doing very fast and very low-memory BEDOPS operations immediately.

With further work, parts of this approach can pipelined using standard input and output streams, which would reduce the amount of expensive disk I/O operations and make this process even faster!

ADD COMMENT

Login before adding your answer.

Traffic: 747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6