I made a related post to Stack Overflow the other day, but I figured that people here might e.g. know some bioinfo tools that do what I want..
So my data is in this kind of format (tab separated parsed from a sam file):
k_12_exc_170 272 FR667691.1 2646 ACATATAAGGTT
k_12_exc_132 256 FR667691.1 8100 AACATGGCGAGA
k_14_exc_1547 16 FR667691.1 8669 ATTATAAAGAATTT
k_13_exc_6130 272 FR667691.1 10710 GAGAAAGAACCAT
k_13_exc_6130 272 FR667691.1 10729 GAGAAAGAACCAT
k_13_exc_8289 16 FR667691.1 10827 TATTTATTGATGG
k_14_exc_19763 272 FR667691.1 15800 TTTAAAATGATTAC
k_15_exc_9848 16 FR667691.1 15800 TTTAAAATGATTACT
k_13_exc_3692 16 FR667691.1 16119 TGGACGAATAAGT
k_14_exc_2426 272 FR667691.1 16119 TGGACGAATAAGTT
k_12_exc_3570 272 FR667691.1 16436 GAACCATTCGAG
What I want to do, is to print all lines as pairs where the value in column 4 of the (to be) "right pair" is at least 100 larger than value in column 4 of the "left pair", but at the same time at most 200 larger. In other words, these lines define chromosomal regions, and I want to output each possible case where two lines are within the rules stated above. I can start this task from the sam file too. These files have ca. 100 k lines each and there can be thousands of them. They are sorted based on columns 3 and 4. Currently I'm doing pretty much what was suggested at Stack Overflow, except I first split each file into 1000 row files with additional 100 tail rows from the previously split file in head (so that many pairs would not be lost along the way). While this is considerably faster than whatever I was doing before, I can't help thinking that there should be some simpler and faster way to do it.. perhaps with bedtools or samtools or something like that?
With example data, only such row should be printed (but I can deal with other formats too but this info should be available):
k_13_exc_6130 272 FR667691.1 10710 GAGAAAGAACCAT k_13_exc_8289 16 FR667691.1 10827 TATTTATTGATGG
So in essence, I want to do conditional join..
Wow. What an incredibly complex answer. Certainly must be efficient though. Perhaps assembly language would be even faster?
For me, this would be the perfect task for R.
If you're looking for absolute maximal efficiency, try the data.table package in R. fread is very very, fast. Personally, I'm used to working with the traditional data.frames so I give the argument data.table=FALSE, even at the slight cost of a loss in efficiency. I use dplyr for the filtering, which is also very fast.
It is not complex at all. It just looks like that if you are not used to pointers. The efficiency is in the algorithm: I basically reuse a window of candidates bounded by
right_lower
andright_upper
to quickly determine the pairs for the next iteration. As the input is sorted, the algorithm runs in time O(output).Use the right tool, for the right job. In this case I needed a programming language able to store structures (or objects, whatever) in an array. As AWK lacks that feature, I switched to the next best language at hand. This is propably a one-shot program, with little reuse value. Thus the need for performance is actually small.
I'm sure it's a good answer. You're right. It's just incomprehensible to me. I'll have to give points to the Perl answer below, but I will have to study C more to understand your code better.
The way the question is asked, it's very difficult to come up with an efficient algorithm anyway, regardless of the language used.
Hi, thanks a lot for your effort. Unfortunately, I really suck at C. I suppose if I move LOWER, UPPER and MAXLINES to main, I can pass on their values as args, like:
..
At least it compiled without problems. However, I still did not figure out how to pass a file to the program (or read from STDIN).
Use
-
as third param to read from stdin. Updated version:Compiled without problems but does not output anything with the test data. Perhaps it's because I'm on OS X right now, and Google tells me that OS X fscanf() is not POSIX 2008 compliant. Also tried to compile with -std=c99. I will try it with CentOS tomorrow. Also, as these coordinates will never reach 100s of millions (data from prokaryotes only), is it really necessary to use long instead of int? Could such thing affect performance noticeably? I know the definition of int in C is a little vague, but I would think that all modern systems use 4 bytes instead of 2 for int..
The OS X is not the problem. Maybe your data is slightly differently formatted than assumed by line 24. Unfortunately that is something only you can debug. If you add
fprintf(stderr,"%zu",fillptr-lines);
in line 27 and the output is NOT the number of lines in the input it will give you the problematic line (or some location near it).No output. I'm almost certain it's because of OS X. BSD fscanf does not know about %m.
http://linux.die.net/man/3/fscanf POSIX 2008
vs
http://www.manpagez.com/man/3/fscanf/ ISO C99
You could try adding
--std=gnu99
. Or link to a different libc. Though I guess it would be easier to just run it on Linux. Dammit, Apple. YUNO%ms‽So no problems compiling on Linux.
I wonder though, why this works:
While this causes segmentation fault during run:
Swap the two parts of the condition. By
right_upper < fillptr
you ensure that right_upper still points to a valid region. Only if that is true, you can followright_upper->baz
. My code above also contains that bug, but I guess, I just was lucky. ☺Thanks.I think my first attempt took ca. 40 min per file, while the Stack Overflow inspired method dropped it to ca. 8 min. Now 1) splitting input file on column 3 with awk, 2) processing and deleting all split files, and 3) combining output to a new file takes altogether ca. 6 seconds. I can definitely live with this. Also, implementing the 3rd column check to your code will be nice C practice for me. You and Jorge Amigo both provided great answers. I can only accept one. You answered first, so congrats :) Maybe later when I have benchmarked all approaches properly I will change first post.
Oh and in case I will accidentally delete source or something:
AWK is great so so much, but some times it pays off to write 50 lines of C. ☺
So I have this problem with the code. When my input data includes 11mer and 12mers, the length of output regions varies between 101 and 200 bp. Then when my input data includes 11mers, 12mers and 13mers, the length of output regions varies between 100 and 200 bp. Then when input data includes 11mers, 12mers, 13mers and 14mers, the length of output regions varies between 99 and 200 bp. As far as I can tell, this rule holds true always, so with 11-15mers, I get 98-200 bp regions, with 11-16mers, 97-200 bp, etc. I thought that perhaps this was somehow related to the strlen stuff, so I even added a 6th column to input data which is "col 4 + lenght col 5 - 1". Then I changed the code as follows:
The coordinates I get correspond to the kmers that are supposed to be there perfectly, however, the length problem persists and I have absolutely no clue of what is causing it..
I have trouble reproducing your problem: The data given in the start post has 5 columns, whereas your code expects a sixth. (end coordinate)
I just realized that it may be a sorting issue. If not, I will post some more example data..
Well, I could not fix it. Went back to five column input. For example with this data:
Expected output is (added column 11 to display length):
However, in addition to those, I get (again added length for clarity):
edit. I fixed it by making the printf stuff conditional..