BBMap's repair.sh swaps sequences between input files
1
0
Entering edit mode
14 months ago
Ram 43k

This is a bit of an XY problem but please go with it just for now.

I have 10X scRNAseq I1, I2, R1 and R2 files. From these, I have extracted a subset of R2 reads into a subset.R2 file. I've also used filterbyname.sh to extract the same named subset from the I1 to a subset.I1 file. However, the subset.R2 and subset.I1 are in different read-name order, so to fix that, I used repair.sh like so:

/utils/bbmap/repair.sh \
  in1=subset.R2.fastq.gz \
  in2=subset.I1.fastq.gz \
  out1=subset.R2.repaired.fastq.gz \
  out2=subset.I1.repaired.fastq.gz \
  outs=R2I1.singletons.fastq.gz \
  repair

Here are the first 10 reads from my subset.R2 and subset.I1 input and output files:

for f in subset.R2.fastq.gz subset.R2.repaired.fastq.gz subset.I1.fastq.gz subset.I1.repaired.fastq.gz
do
  echo $f
  bioawk -c fastx 'NR<11{print $name, $comment, $seq} NR==11{exit 0}' $f
done
subset.R2.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GATTGACCTTAAGTTCATTGACACCACCTCCAAGTTTGGCCATGGCCGCTTCCAGACCATGGAGGAGAAGAAAGCATTCATGGGACCACT
A00431:359:H7CYNDSX3:4:1101:1416:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     ACAAGATCGGCAAGCCCCACACTGTCCCTTGCAAGGTGACAGGCCGCTGCGGCTCTGTGCTGGTACGCCTCATCCCTGCACCCAGGGGCA
A00431:359:H7CYNDSX3:4:1101:1181:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GGTGGCTCACACCTGTATTCCCAGCTCTTTGGGAGGCTGAGGCAAGAGGATCACTTAAAGTCAGGAGTTCAAAACCAGCCTGGGCAACAT
A00431:359:H7CYNDSX3:4:1101:2446:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AAGAGATGGAGAAACTAAAGATGCAAAACACAGAGGAATACAGGCCAGGCACGGTGGCTCACGCCTGTAATCCTAACACTTCGGGAGGCC
A00431:359:H7CYNDSX3:4:1101:2899:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTGGTATCAACGCAGAGTACATGGGGGTAGCGGTGGCTTAAGCCGCGCGGAGCAGCGCAACCTGGGTCGCTCCCTGCTTCGCCGCCGCCT
A00431:359:H7CYNDSX3:4:1101:2862:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GAGTAGTCGCATTGATGATCTGGAAAAGAATATCGCGGACCTCATGACACAGGCTGGGGTGGAAGAACTGGAAAGTGAAAACAAGATACC
A00431:359:H7CYNDSX3:4:1101:3224:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AAGCAGTGGTATCAACGCAGAGTACATGGGATCAGATCAAAACCAACCCGGTCAGCCCCTCTCCGGACCCGGCCGGGGGGCGGGCGCCGG
A00431:359:H7CYNDSX3:4:1101:3170:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     CCAGGATGCTATAAAATCACCACGATCTTTAGCCATGCACAAACGGTAGTTTTGTGTGTTGGCTGCTCCACTGTCCTCTGCCAGCCTACA
A00431:359:H7CYNDSX3:4:1101:3278:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AATGGTGCTCACCATGCTTCCAGCTAACAGGTCTAGAAAACCAGCTTGCGAATAACAGTCCCCGTGGCCATCCCTGTGAGGGTGACGTTA
A00431:359:H7CYNDSX3:4:1101:3748:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     TTTCATTAGACTCCAGTGGTCTACCTTGCACTTTGAGTGAAACTTTTTCCCATGAATAATTTTGTGAAATCATGCATTTGGCACATGGAA
subset.R2.repaired.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:1181:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:1416:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2446:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2862:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2899:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3170:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3224:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3278:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3748:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
subset.I1.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1181:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:1199:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:1416:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2446:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2862:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2899:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3170:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3224:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3278:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3748:1000   1:N:0:CCCAGCTTCT+GTTTGGTGTC     CCCAGCTTCT
subset.I1.repaired.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GATTGACCTTAAGTTCATTGACACCACCTCCAAGTTTGGCCATGGCCGCTTCCAGACCATGGAGGAGAAGAAAGCATTCATGGGACCACT
A00431:359:H7CYNDSX3:4:1101:1181:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GGTGGCTCACACCTGTATTCCCAGCTCTTTGGGAGGCTGAGGCAAGAGGATCACTTAAAGTCAGGAGTTCAAAACCAGCCTGGGCAACAT
A00431:359:H7CYNDSX3:4:1101:1416:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     ACAAGATCGGCAAGCCCCACACTGTCCCTTGCAAGGTGACAGGCCGCTGCGGCTCTGTGCTGGTACGCCTCATCCCTGCACCCAGGGGCA
A00431:359:H7CYNDSX3:4:1101:2446:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AAGAGATGGAGAAACTAAAGATGCAAAACACAGAGGAATACAGGCCAGGCACGGTGGCTCACGCCTGTAATCCTAACACTTCGGGAGGCC
A00431:359:H7CYNDSX3:4:1101:2862:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GAGTAGTCGCATTGATGATCTGGAAAAGAATATCGCGGACCTCATGACACAGGCTGGGGTGGAAGAACTGGAAAGTGAAAACAAGATACC
A00431:359:H7CYNDSX3:4:1101:2899:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTGGTATCAACGCAGAGTACATGGGGGTAGCGGTGGCTTAAGCCGCGCGGAGCAGCGCAACCTGGGTCGCTCCCTGCTTCGCCGCCGCCT
A00431:359:H7CYNDSX3:4:1101:3170:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     CCAGGATGCTATAAAATCACCACGATCTTTAGCCATGCACAAACGGTAGTTTTGTGTGTTGGCTGCTCCACTGTCCTCTGCCAGCCTACA
A00431:359:H7CYNDSX3:4:1101:3224:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AAGCAGTGGTATCAACGCAGAGTACATGGGATCAGATCAAAACCAACCCGGTCAGCCCCTCTCCGGACCCGGCCGGGGGGCGGGCGCCGG
A00431:359:H7CYNDSX3:4:1101:3278:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AATGGTGCTCACCATGCTTCCAGCTAACAGGTCTAGAAAACCAGCTTGCGAATAACAGTCCCCGTGGCCATCCCTGTGAGGGTGACGTTA
A00431:359:H7CYNDSX3:4:1101:3748:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     TTTCATTAGACTCCAGTGGTCTACCTTGCACTTTGAGTGAAACTTTTTCCCATGAATAATTTTGTGAAATCATGCATTTGGCACATGGAA

As you can see, the sequences have been swapped. Why did repair.sh do this? Did it look at the 1: and 2: in the $comment field, compare it to the 1 and 2 in the in1/in2 and decide to "fix" that by writing the "right" 1: to the out1 and 2: to the out2 files? If so, then it will mean that if I swap R1 and R2 by mistake in the input, the output will "fix" my output so my out2 will correspond to my in1? That's just a tool doing too much, no?

repair.sh bbmap • 1.1k views
ADD COMMENT
0
Entering edit mode

Based on GenoMax's recommendation (offline), I swapped my files so in1 was my subset.I1 and in2 was my subset.R2 and the output looks right now. Next up, checking what happens when I give in2=subset.R2 and in1=subset.I2 with the option allowidenticalnames (ain=t)

ADD REPLY
0
Entering edit mode

It looks like repair.sh cannot handle it if both input files have 2:. See my trial run on subset.R2 and subset.I2:

/utils/bbmap/repair.sh \
  in2=subset.R2.fastq.gz \
  in1=subset.I2.fastq.gz \
  out2=subset.R2.repaired.fastq.gz \
  out1=subset.I2.repaired.fastq.gz \
  repair \
  ain=t

Checking the first 10 reads of input and output files:

for f in subset.R2.fastq.gz subset.R2.repaired.fastq.gz suset.I2.fastq.gz subset.I2.repaired.fastq.gz
do
  echo $f
  bioawk -c fastx 'NR<11{print $name, $comment, $seq} NR==11{exit 0}' $f
done

subset.R2.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GATTGACCTTAAGTTCATTGACACCACCTCCAAGTTTGGCCATGGCCGCTTCCAGACCATGGAGGAGAAGAAAGCATTCATGGGACCACT
A00431:359:H7CYNDSX3:4:1101:1416:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     ACAAGATCGGCAAGCCCCACACTGTCCCTTGCAAGGTGACAGGCCGCTGCGGCTCTGTGCTGGTACGCCTCATCCCTGCACCCAGGGGCA
A00431:359:H7CYNDSX3:4:1101:1181:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GGTGGCTCACACCTGTATTCCCAGCTCTTTGGGAGGCTGAGGCAAGAGGATCACTTAAAGTCAGGAGTTCAAAACCAGCCTGGGCAACAT
A00431:359:H7CYNDSX3:4:1101:2446:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AAGAGATGGAGAAACTAAAGATGCAAAACACAGAGGAATACAGGCCAGGCACGGTGGCTCACGCCTGTAATCCTAACACTTCGGGAGGCC
A00431:359:H7CYNDSX3:4:1101:2899:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTGGTATCAACGCAGAGTACATGGGGGTAGCGGTGGCTTAAGCCGCGCGGAGCAGCGCAACCTGGGTCGCTCCCTGCTTCGCCGCCGCCT
A00431:359:H7CYNDSX3:4:1101:2862:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GAGTAGTCGCATTGATGATCTGGAAAAGAATATCGCGGACCTCATGACACAGGCTGGGGTGGAAGAACTGGAAAGTGAAAACAAGATACC
A00431:359:H7CYNDSX3:4:1101:3224:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AAGCAGTGGTATCAACGCAGAGTACATGGGATCAGATCAAAACCAACCCGGTCAGCCCCTCTCCGGACCCGGCCGGGGGGCGGGCGCCGG
A00431:359:H7CYNDSX3:4:1101:3170:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     CCAGGATGCTATAAAATCACCACGATCTTTAGCCATGCACAAACGGTAGTTTTGTGTGTTGGCTGCTCCACTGTCCTCTGCCAGCCTACA
A00431:359:H7CYNDSX3:4:1101:3278:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AATGGTGCTCACCATGCTTCCAGCTAACAGGTCTAGAAAACCAGCTTGCGAATAACAGTCCCCGTGGCCATCCCTGTGAGGGTGACGTTA
A00431:359:H7CYNDSX3:4:1101:3748:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     TTTCATTAGACTCCAGTGGTCTACCTTGCACTTTGAGTGAAACTTTTTCCCATGAATAATTTTGTGAAATCATGCATTTGGCACATGGAA
subset.R2.repaired.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GATTGACCTTAAGTTCATTGACACCACCTCCAAGTTTGGCCATGGCCGCTTCCAGACCATGGAGGAGAAGAAAGCATTCATGGGACCACT
A00431:359:H7CYNDSX3:4:1101:1416:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     ACAAGATCGGCAAGCCCCACACTGTCCCTTGCAAGGTGACAGGCCGCTGCGGCTCTGTGCTGGTACGCCTCATCCCTGCACCCAGGGGCA
A00431:359:H7CYNDSX3:4:1101:1181:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:2446:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:2899:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTGGTATCAACGCAGAGTACATGGGGGTAGCGGTGGCTTAAGCCGCGCGGAGCAGCGCAACCTGGGTCGCTCCCTGCTTCGCCGCCGCCT
A00431:359:H7CYNDSX3:4:1101:2862:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:3224:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AAGCAGTGGTATCAACGCAGAGTACATGGGATCAGATCAAAACCAACCCGGTCAGCCCCTCTCCGGACCCGGCCGGGGGGCGGGCGCCGG
A00431:359:H7CYNDSX3:4:1101:3170:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:3278:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:3748:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
subset.I2.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1181:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:1199:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:1416:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:2446:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:2862:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:2899:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:3170:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:3224:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:3278:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:3748:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
subset.I2.repaired.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:1416:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:1181:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GGTGGCTCACACCTGTATTCCCAGCTCTTTGGGAGGCTGAGGCAAGAGGATCACTTAAAGTCAGGAGTTCAAAACCAGCCTGGGCAACAT
A00431:359:H7CYNDSX3:4:1101:2446:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AAGAGATGGAGAAACTAAAGATGCAAAACACAGAGGAATACAGGCCAGGCACGGTGGCTCACGCCTGTAATCCTAACACTTCGGGAGGCC
A00431:359:H7CYNDSX3:4:1101:2899:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:2862:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GAGTAGTCGCATTGATGATCTGGAAAAGAATATCGCGGACCTCATGACACAGGCTGGGGTGGAAGAACTGGAAAGTGAAAACAAGATACC
A00431:359:H7CYNDSX3:4:1101:3224:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     GTTTGGTGTC
A00431:359:H7CYNDSX3:4:1101:3170:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     CCAGGATGCTATAAAATCACCACGATCTTTAGCCATGCACAAACGGTAGTTTTGTGTGTTGGCTGCTCCACTGTCCTCTGCCAGCCTACA
A00431:359:H7CYNDSX3:4:1101:3278:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     AATGGTGCTCACCATGCTTCCAGCTAACAGGTCTAGAAAACCAGCTTGCGAATAACAGTCCCCGTGGCCATCCCTGTGAGGGTGACGTTA
A00431:359:H7CYNDSX3:4:1101:3748:1000   2:N:0:CCCAGCTTCT+GTTTGGTGTC     TTTCATTAGACTCCAGTGGTCTACCTTGCACTTTGAGTGAAACTTTTTCCCATGAATAATTTTGTGAAATCATGCATTTGGCACATGGAA

As you can see, some I2 reads are in the R2 output and some R2 reads are in the I2 output.

ADD REPLY
0
Entering edit mode

Please check your input files and verify that they are not mixed.

ADD REPLY
0
Entering edit mode

I've printed the first 10 reads from both input and output files above - the input is not mixed at all. You can see that from the matching IDs moved between input and output files.

ADD REPLY
2
Entering edit mode
14 months ago
GenoMax 141k

Based on some testing done with @Ram offline, I was able to replicate the issue he sees above. While the pairing works properly repair process shuffles the sequences randomly between the two output files leading to what he observes above.


Solution:

Work around for this problem is to change the (2:N:) in the header of one of the files to (1:N:) to do the repair. This brings the sequences back in compliance temporarily. Once the repair is done then the headers can be changed back to the original (2:N:) designation.

@Ram is testing this now. But we expect it to work.

ADD COMMENT
0
Entering edit mode

Yes, this works. To add broader context, this is the overall approach to re-order a subset of the I1,I2, and R1 files based on the order of reads in the R2 subset:

  1. repair using in2=R2, in1=R1, out2=R2.repaired, out1=R1.repaired
  2. repair using in2=R2, in1=I1, out2=R2.repaired, out1=I1.repaired (I tested without R2.repaired from step-1 being overwritten, the files are identical)
  3. Create a temp FASTQ file I2.2_to_1.temp replacing all 2: to 1: (sed '1~4s/ 2:/ 1:/')
  4. repair using in2=R2, in1=I2.2_to_1.temp, out2=R2.repaired, out1=I2.pre_1_to_2.repaired
  5. Change all :1 to : 2 in the temp output file (sed '1~4s/ 1:/ 2:/') to create I2.repaired
  6. Delete temp input (I2.2_to_1.temp) and output (I2.pre_1_to_2.repaired) files

This will work if all read files have reads with the exact same $names but in different orders.

Of course, a more straightforward/simpler approach would be to independently sort each FASTQ file, but sorting takes a really long time and repair is relatively much faster.

ADD REPLY

Login before adding your answer.

Traffic: 1999 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6