My paired-end .bcl files are slightly corrupted, so that after aligning, R2.fastq is slightly smaller than R1.fastq. Using grep I've found that none of the R2 have random truncation, but instead are only missing full records. Unfortunately, rhese records are missing at random throughout the file, rather than one large chunk. I would like to use SeqIO to remove records missing in R2.fastq from R1.fastq and I.fastq.
Is there a way to find the intersect of record headers in SeqIO iterators of R1, R2, I?
BBMap
repair.sh
works beautifully, thank you.After running on 50 samples, repair worked for 15 but I continue to get a fastq header mismatch between R1 and R2 for the other 35. I spot checked a line where alignment threw an error and headers were mismatched. I re-ran repair.sh and it returned 100% concordance between R1, R2, I1 for all samples that continue to fail alignment.
here's an example of a mismatched line after bbmap/repair.sh
R1
@A01073:21:HT22MDMXX:2:2369:0:4015602 1:N:0:GTCTCTCG
R2
@A01073:21:HT22MDMXX:2:2369:0:4015903 2:N:0:GTCTCTCG
I1
@A01073:21:HT22MDMXX:2:2369:0:4015602 1:N:0:GTCTCTCG
repair.sh
is only meant to be used for paired end files. If there is a concordance problem with index read files then I suggest that yourepair
R1/R2 files first. Take the read headers from repaired files and fish out reads from index file to match usingfilterbyname.sh
. Then you should have all three files in sync.