loop over lines group by group and do something separately in text file
5
5
Entering edit mode
9.8 years ago

Hello, everyone! I have a file containing huge number of lines and each line have several fields. I sorted the lines based on the first field. Lines with the same first field were grouped together and numbers in each group is flexible. I would like to loop over the whole file and do something on each group of the line separately. I do not know how to loop over the file and make it possible for me to treat the lines group by group. Does anyone could give me some suggestions? Thanks!

Here is a example:

seq_1  chr1 12
seq_1  chr2 34
.
.
seq_1  chr3 57
seq_3  chr1 34
seq_3  chr1 26
.
.
seq_3  chr4 47
seq_4  chr9 78
seq_5  chr8 90
seq_5  chr7 77

I want to do something on group seq_1, seq_2, seq_3, seq_4, seq_5, ... separately.

RNA-Seq • 5.5k views
ADD COMMENT
6
Entering edit mode
9.8 years ago

Others have said you can do this with programming logic and Pierre has shown how GNU Parallel could be used to process the blocks. I'd like to expand on the idea of using GNU Parallel to process your blocks and do away with reading/writing the blocks to intermediary files. This way you can process your blocks in parallel and do away with disk IO for an already large file.

Lets create your mock input file:

$ echo -e "seq_1\tchr1\t12
seq_1\tchr2\t34
seq_1\tchr3\t57
seq_3\tchr1\t34
seq_3\tchr1\t26
seq_3\tchr4\t47
seq_4\tchr9\t78
seq_5\tchr8\t90
seq_5\tchr7\t77" > input.txt

Now lets look at inserting a record separator at the start of each block - defined here by whenever the first column changes its value. We'll use awk to do this and use "----" as the record separator as we don't expect to find this within our file:

$ awk '$1 != previous{print "----"}{previous=$1}1' input.txt
----
seq_1    chr1    12
seq_1    chr2    34
seq_1    chr3    57
----
seq_3    chr1    34
seq_3    chr1    26
seq_3    chr4    47
----
seq_4    chr9    78
----
seq_5    chr8    90
seq_5    chr7    77

OK, we can now use GNU parallel to process our blocks and using ---- as the record start. We need to remove this line from each block being processed by GNU Parallel - we'll use --remove-rec-sep to achieve this:

$ awk '$1 != previous{print "----"}{previous=$1}1' input.txt \
  | parallel --gnu --keep-order --spreadstdin -N 1 --recstart '----\n' --remove-rec-sep cat
seq_1    chr1    12
seq_1    chr2    34
seq_1    chr3    57
seq_3    chr1    34
seq_3    chr1    26
seq_3    chr4    47
seq_4    chr9    78
seq_5    chr8    90
seq_5    chr7    77

Now we can start put in the my_cmd you want to run on each block using something like:

$ awk '$1 != previous{print "----"}{previous=$1}1' input.txt \
  | parallel --gnu --keep-order --spreadstdin -N 1 --recstart '----\n' --remove-rec-sep | my_cmd'

To test that GNU Parallel is processing each block separately, use awk to prefix each line with the number of lines encountered by each job:

$ awk '$1 != previous{print "----"}{previous=$1}1' input.txt \
  | parallel --gnu --keep-order --spreadstdin -N 1 --recstart '----\n' --remove-rec-sep | awk "{print FNR,\$0}"'
1 seq_1    chr1    12
2 seq_1    chr2    34
3 seq_1    chr3    57
1 seq_3    chr1    34
2 seq_3    chr1    26
3 seq_3    chr4    47
1 seq_4    chr9    78
1 seq_5    chr8    90
2 seq_5    chr7    77

Some advice on using pipelines within GNU Parallel:

  • Each GNU Parallel job will use more than 1 core. In the above and 1 core for sed - reduce the number of jobs run in parallel to account for this.
  • Using pipelines with single and double quotes can be a nightmare - consider moving it into a single function which you cann from GNU Parallel.
ADD COMMENT
1
Entering edit mode

Instead of sed -e "1d" you can use --rrs (--remove-rec-sep).

Nathan's observation on cores is not quite true. It is true that they will be run as separate processes, but unless they use exactly the same amount of compute time, they will not use a full core each. So the best advice is to measure. E.g. try with -j100% and -j50% and see which is faster.

ADD REPLY
0
Entering edit mode

Thank's Ole that's great info!! I've updated my answer to use `--remove-rec-sep` instead of the seperate `awk` command. Note, the doc describes `--remove-rec-sep`, `--rrs` and `--removerecsep` but no mention of `--remove-record-separators`.

ADD REPLY
0
Entering edit mode

this is a great explanation! and, I think it's a pattern that would be nice to encapsulate somehow (though I guess copy-pasting that awk isn't too bad).

ADD REPLY
4
Entering edit mode
9.8 years ago

Use awk and redirection invoke a command with your files

awk '{ print $0 > $1".myext" }' input.txt && for F in *.myext; do mycmd ${F}; done

or use xargs

awk '{ print $0 > $1".myext" }' input.txt &&  ls *.myext | xargs mycmd

or use GNU parallel

awk '{ print $0 > $1".myext" }' input.txt &&  parallel mycmd ::: *.myext
ADD COMMENT
0
Entering edit mode

Oops, typo: changed $1."myext" to $1".myext"

ADD REPLY
1
Entering edit mode
9.8 years ago
iraun 6.2k

Using perl you can read the file line by line with a while loop and process them latter:

1) Open file --> perl "open" function
2) Read lines one by of the opened file using: while (<YOUROPENEDFILE>) {}
3) Save each line contents in array: @F = split /\t/;  (assuming that columns are separated by tab.
4) Get the first element of array (Sequence name/ID) : my $seq = $F[0]
5) Go to the next line of a file and compare if $F[0] of previous line is equal to new $F[0]. If it is equal "do something" since both seqs are of the same group.


I think you should try to write a little perl script. Step by step, checking each step... And if you get stuck, you can ask a more specific question, instead of "make my script". Also, I would suggest StackOverFlow as the best place to ask questions about programming.

Cheers,

ADD COMMENT
1
Entering edit mode
9.8 years ago
Marc Perry ▴ 50

One solution is to build a data structure as you iterate over the incoming lines of data. Then, after you finish reading the file you iterate over the data structure and perform the processing action on the stored lines.

You mention the file is large. This approach may fail if the data structure uses up all your RAM (or thrashing where the program starts swapping to the hard disk and slows to a crawl). In that case, since the file is sorted you can store the incoming lines temporarily, and then use a conditional test to trigger processing of the current batch, at which point you print the results and flush the cache, then you start accumulating the next batch of lines. This is similar to treating each cluster as a separate, multi-line record, as if you had interpolated a line containing a record separator symbol at the end of each block (I often use '%%' on a line by itself as an RS after reading about it in Dave Cross' excellent book, "Data Munging withy Perl" (Manning, 2000)).

ADD COMMENT
0
Entering edit mode
9.8 years ago
cat your_file | awk '{print $1}' | uniq > my_groups
while read line
do
 your_command $line >> new_file
done < my_groups
ADD COMMENT

Login before adding your answer.

Traffic: 1814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6