GNU Parallel Block Issues
1
2
Entering edit mode
6.5 years ago
salamayg ▴ 20

Hello,

I am using GNU parallel to speed up my BLAST jobs. I have seen the example outlined in the following post (Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them) and used the command:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -db db.fa -query - > results

I am noticing that in the BLAST output generated, sequences are missing (~30 from 5000), and if I run parallel and just examine the blocks that are generated, it seems that parallel loses a certain number of records (fasta records) each time it creates a new block. It doesn't seem like the block is breaking at the correct place. Does anyone have any clue as to why this is happening? Any help is appreciated.

Thank you.

blast parallel GNU parallel block • 2.6k views
0
Entering edit mode
6.5 years ago
ole.tange ★ 4.0k

To see if parallel is to blame try using 'cat' instead of blastp:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe cat > results

If you get all the sequences, then GNU Parallel is not losing them.

0
Entering edit mode

Yes, when I do that it is missing sequences. If I wc the original input and the parallel output, it goes from 10000 to 9588. In the parallel output, the first line starts as:

>CTTCACTAGCT

>r9

sequence

whereas in the original file, it started as:

>r1

sequence

>r2

sequence

etc.

I think there are other instances of missing sequences (where each block is made) but it is hard to find them without going through 10000 lines manually. Do you have any ideas how I could trouble shoot this? Parallel is extremely useful to me but with this little issue I cannot use it.

edit: I have found another instance where there is a skip in the read numbers and where the header is altered.

>r187.1 |SOURCES={GI=330827700,fw,273802-273903}|ERRORS={27:C,30:C,32:T,62:A,96:A,99:A}|SOURCE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
CGGGATCGTGCGGGTGGCTCTGTGCATCCTCGTTGGTTTGAGCGGGGGATGAGTCTGCCGTCAGTGCAGTGGGCCAGAGCAACACCCCGCCAAGCAACAAG

>CE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
ACAGGTCGTGGTTGTGCAGCCCGCCAGCAGCAATTCGAGAGTCATGGGACGCCCCACTATGATGGACGCTCCCACCACGCCAGCATGCAGACCGTGCATCT

>r195.2 |SOURCES={GI=330827700,bw,2584285-2584386}|ERRORS={36:G,40:A,42:A}|SOURCE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
ACCCGCAGCCGACCAAATTGCAGCTGCTGGGGAACGGTGCAGATGGCTACCGGTTGGGTTGATGCTTGCGGCTGCTGGCGAAACTGCACCCGAAGTCGGGC

0
Entering edit mode

If that is true, you have found a bug. Can you make an example available for download? Quoting it here is unfortunately not enough as \n may be quoted wrongly.

0
Entering edit mode

I also tried parallel and piping to cat on a different fasta file with more simple headers (to see if there was an issue in the header of the original file) but it would still do the same thing.

Also, the exact command I use is:

cat test.fna | parallel -k --gnu --block 100k --recstart '>' --pipe cat > results

0
Entering edit mode

It gives exactly the same on 3 of my systems:

\$ md5sum test.fna results
cc27ec20250c65fdbcc0e23fa132eb83  test.fna
cc27ec20250c65fdbcc0e23fa132eb83  results

So what is hitting you is something on your local system. This changes the bug from simple fix to harder debugging, and that should not be done on Biostars.org. Post to bug-parallel@gnu.org and follow "REPORTING BUGS" in 'man parallel'.

0
Entering edit mode

Okay thank you very much.

0
Entering edit mode

Did you ever find a resolution to this issue? I have also experienced the same issue GNU parallel 20160422