This is a bit of an XY problem but please go with it just for now.
I have 10X scRNAseq I1, I2, R1 and R2 files. From these, I have extracted a subset of R2 reads into a subset.R2 file. I've also used filterbyname.sh
to extract the same named subset from the I1 to a subset.I1 file. However, the subset.R2 and subset.I1 are in different read-name order, so to fix that, I used repair.sh like so:
/utils/bbmap/repair.sh \
in1=subset.R2.fastq.gz \
in2=subset.I1.fastq.gz \
out1=subset.R2.repaired.fastq.gz \
out2=subset.I1.repaired.fastq.gz \
outs=R2I1.singletons.fastq.gz \
repair
Here are the first 10 reads from my subset.R2 and subset.I1 input and output files:
for f in subset.R2.fastq.gz subset.R2.repaired.fastq.gz subset.I1.fastq.gz subset.I1.repaired.fastq.gz
do
echo $f
bioawk -c fastx 'NR<11{print $name, $comment, $seq} NR==11{exit 0}' $f
done
subset.R2.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC GATTGACCTTAAGTTCATTGACACCACCTCCAAGTTTGGCCATGGCCGCTTCCAGACCATGGAGGAGAAGAAAGCATTCATGGGACCACT
A00431:359:H7CYNDSX3:4:1101:1416:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC ACAAGATCGGCAAGCCCCACACTGTCCCTTGCAAGGTGACAGGCCGCTGCGGCTCTGTGCTGGTACGCCTCATCCCTGCACCCAGGGGCA
A00431:359:H7CYNDSX3:4:1101:1181:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC GGTGGCTCACACCTGTATTCCCAGCTCTTTGGGAGGCTGAGGCAAGAGGATCACTTAAAGTCAGGAGTTCAAAACCAGCCTGGGCAACAT
A00431:359:H7CYNDSX3:4:1101:2446:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC AAGAGATGGAGAAACTAAAGATGCAAAACACAGAGGAATACAGGCCAGGCACGGTGGCTCACGCCTGTAATCCTAACACTTCGGGAGGCC
A00431:359:H7CYNDSX3:4:1101:2899:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC GTGGTATCAACGCAGAGTACATGGGGGTAGCGGTGGCTTAAGCCGCGCGGAGCAGCGCAACCTGGGTCGCTCCCTGCTTCGCCGCCGCCT
A00431:359:H7CYNDSX3:4:1101:2862:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC GAGTAGTCGCATTGATGATCTGGAAAAGAATATCGCGGACCTCATGACACAGGCTGGGGTGGAAGAACTGGAAAGTGAAAACAAGATACC
A00431:359:H7CYNDSX3:4:1101:3224:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC AAGCAGTGGTATCAACGCAGAGTACATGGGATCAGATCAAAACCAACCCGGTCAGCCCCTCTCCGGACCCGGCCGGGGGGCGGGCGCCGG
A00431:359:H7CYNDSX3:4:1101:3170:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC CCAGGATGCTATAAAATCACCACGATCTTTAGCCATGCACAAACGGTAGTTTTGTGTGTTGGCTGCTCCACTGTCCTCTGCCAGCCTACA
A00431:359:H7CYNDSX3:4:1101:3278:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC AATGGTGCTCACCATGCTTCCAGCTAACAGGTCTAGAAAACCAGCTTGCGAATAACAGTCCCCGTGGCCATCCCTGTGAGGGTGACGTTA
A00431:359:H7CYNDSX3:4:1101:3748:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC TTTCATTAGACTCCAGTGGTCTACCTTGCACTTTGAGTGAAACTTTTTCCCATGAATAATTTTGTGAAATCATGCATTTGGCACATGGAA
subset.R2.repaired.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:1181:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:1416:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2446:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2862:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2899:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3170:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3224:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3278:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3748:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
subset.I1.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1181:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:1199:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:1416:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2446:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2862:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:2899:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3170:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3224:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3278:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
A00431:359:H7CYNDSX3:4:1101:3748:1000 1:N:0:CCCAGCTTCT+GTTTGGTGTC CCCAGCTTCT
subset.I1.repaired.fastq.gz
A00431:359:H7CYNDSX3:4:1101:1199:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC GATTGACCTTAAGTTCATTGACACCACCTCCAAGTTTGGCCATGGCCGCTTCCAGACCATGGAGGAGAAGAAAGCATTCATGGGACCACT
A00431:359:H7CYNDSX3:4:1101:1181:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC GGTGGCTCACACCTGTATTCCCAGCTCTTTGGGAGGCTGAGGCAAGAGGATCACTTAAAGTCAGGAGTTCAAAACCAGCCTGGGCAACAT
A00431:359:H7CYNDSX3:4:1101:1416:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC ACAAGATCGGCAAGCCCCACACTGTCCCTTGCAAGGTGACAGGCCGCTGCGGCTCTGTGCTGGTACGCCTCATCCCTGCACCCAGGGGCA
A00431:359:H7CYNDSX3:4:1101:2446:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC AAGAGATGGAGAAACTAAAGATGCAAAACACAGAGGAATACAGGCCAGGCACGGTGGCTCACGCCTGTAATCCTAACACTTCGGGAGGCC
A00431:359:H7CYNDSX3:4:1101:2862:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC GAGTAGTCGCATTGATGATCTGGAAAAGAATATCGCGGACCTCATGACACAGGCTGGGGTGGAAGAACTGGAAAGTGAAAACAAGATACC
A00431:359:H7CYNDSX3:4:1101:2899:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC GTGGTATCAACGCAGAGTACATGGGGGTAGCGGTGGCTTAAGCCGCGCGGAGCAGCGCAACCTGGGTCGCTCCCTGCTTCGCCGCCGCCT
A00431:359:H7CYNDSX3:4:1101:3170:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC CCAGGATGCTATAAAATCACCACGATCTTTAGCCATGCACAAACGGTAGTTTTGTGTGTTGGCTGCTCCACTGTCCTCTGCCAGCCTACA
A00431:359:H7CYNDSX3:4:1101:3224:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC AAGCAGTGGTATCAACGCAGAGTACATGGGATCAGATCAAAACCAACCCGGTCAGCCCCTCTCCGGACCCGGCCGGGGGGCGGGCGCCGG
A00431:359:H7CYNDSX3:4:1101:3278:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC AATGGTGCTCACCATGCTTCCAGCTAACAGGTCTAGAAAACCAGCTTGCGAATAACAGTCCCCGTGGCCATCCCTGTGAGGGTGACGTTA
A00431:359:H7CYNDSX3:4:1101:3748:1000 2:N:0:CCCAGCTTCT+GTTTGGTGTC TTTCATTAGACTCCAGTGGTCTACCTTGCACTTTGAGTGAAACTTTTTCCCATGAATAATTTTGTGAAATCATGCATTTGGCACATGGAA
As you can see, the sequences have been swapped. Why did repair.sh
do this? Did it look at the 1:
and 2:
in the $comment
field, compare it to the 1
and 2
in the in1
/in2
and decide to "fix" that by writing the "right" 1:
to the out1
and 2:
to the out2
files? If so, then it will mean that if I swap R1 and R2 by mistake in the input, the output will "fix" my output so my out2
will correspond to my in1
? That's just a tool doing too much, no?
Based on GenoMax's recommendation (offline), I swapped my files so
in1
was mysubset.I1
andin2
was mysubset.R2
and the output looks right now. Next up, checking what happens when I givein2
=subset.R2
andin1
=subset.I2
with the optionallowidenticalnames
(ain=t
)It looks like
repair.sh
cannot handle it if both input files have2:
. See my trial run onsubset.R2
andsubset.I2
:Checking the first 10 reads of input and output files:
As you can see, some I2 reads are in the R2 output and some R2 reads are in the I2 output.
Please check your input files and verify that they are not mixed.
I've printed the first 10 reads from both input and output files above - the input is not mixed at all. You can see that from the matching IDs moved between input and output files.