Question

paired-end fastq corruption cleanup: intersect of record headers in SeqIO.parse() iterators

0

Entering edit mode

2.9 years ago

kyle • 0

My paired-end .bcl files are slightly corrupted, so that after aligning, R2.fastq is slightly smaller than R1.fastq. Using grep I've found that none of the R2 have random truncation, but instead are only missing full records. Unfortunately, rhese records are missing at random throughout the file, rather than one large chunk. I would like to use SeqIO to remove records missing in R2.fastq from R1.fastq and I.fastq.

Is there a way to find the intersect of record headers in SeqIO iterators of R1, R2, I?

ngs seqio sequencing alignment • 1.1k views

ADD COMMENT • link updated 2.9 years ago by GenoMax 141k • written 2.9 years ago by kyle • 0

score 1 · Answer 1 · 2021-05-25

1

Entering edit mode

2.9 years ago

GenoMax 141k

after aligning, R2.fastq is slightly smaller than R1.fastq.

Not sure what you mean by that. If your R1/R2 files are no longer in sync then use repair.sh from BBMap suite to bring them back in sync and remove singletons into a new file.

ADD COMMENT • link 2.9 years ago by GenoMax 141k

0

Entering edit mode

BBMap repair.sh works beautifully, thank you.

ADD REPLY • link 2.9 years ago by kyle • 0

0

Entering edit mode

After running on 50 samples, repair worked for 15 but I continue to get a fastq header mismatch between R1 and R2 for the other 35. I spot checked a line where alignment threw an error and headers were mismatched. I re-ran repair.sh and it returned 100% concordance between R1, R2, I1 for all samples that continue to fail alignment.

here's an example of a mismatched line after bbmap/repair.sh

R1

@A01073:21:HT22MDMXX:2:2369:0:4015602 1:N:0:GTCTCTCG

R2

@A01073:21:HT22MDMXX:2:2369:0:4015903 2:N:0:GTCTCTCG

I1

@A01073:21:HT22MDMXX:2:2369:0:4015602 1:N:0:GTCTCTCG

ADD REPLY • link 2.9 years ago by kyle • 0

1

Entering edit mode

repair.sh is only meant to be used for paired end files. If there is a concordance problem with index read files then I suggest that you repair R1/R2 files first. Take the read headers from repaired files and fish out reads from index file to match using filterbyname.sh. Then you should have all three files in sync.

ADD REPLY • link 2.9 years ago by GenoMax 141k