Question

Merge paired reads for error-correction

0

Entering edit mode

4.3 years ago

Mick ▴ 30

Hi, i have two fastq files for each of my samples which each contain about 1M reads (one file for forward and one for reverse reads). I have the demultiplexing and cutting of adapters done, so the reads are only genomic sequences. The reads are quite short (~100bp) and cover the same genomic sequence (ultra deep sequencing). Forward and Reverse read overlap for the full 100bp. I would like to merge the forward and reverse reads into one consensus sequence, such that I am left with only one fastq file. Reads should have Ns at positions where forward and reverse read did not match.

I know there are a lot of programs like bbmerge.sh, pear etc. but none of them seem to be able to do exactly what I want, which is only keep matching bases and have the rest as Ns.

Maybe someone knows a tool that does this, if not I'll probably align both files seperately with bwa and then go through the reads with a python script and do the matching myself.

Hope my explanation was clear enough, thank you in advance.

merge paired-reads • 912 views

ADD COMMENT • link 4.3 years ago by Mick ▴ 30

0

Entering edit mode

which is only keep matching bases and have the rest as Ns.

Since this is a very specific (and not common) request you may need to (or find someone who can) code this yourself. If you know that the reads overlap for sure use the original read files rather than going through the alignments.

That said, I see the following note in bbmerge.sh guide:

If the bases differ and the scores are equal, the base will be replaced with N.

If you want to put a feature request in via BBMap source forge ticket page, Brian Bushnell may be able to change bbmerge code to accommodate your request i.e. replace with N's if the bases are different irrespective of score.

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much for the quick reply. I will write Brian on sourceforge and see if it can be done. I will also start working on my own script. I'm not a programmer but I think I can put something together. Why do you think it would be better to use the fastq files directly and not the mapped reads. I think the mapped reads would be easier to handle in the rare event, where one of the reads has an insert or deletion, that is not present in the reverse read. One could just compare the cigar strings to one another and go from there.

If i use the fastq-files directly should i use the needleman-wunsch algorithm to match the strings? And what do i do in the case of inserts or deletions that only occur in one read?

ADD REPLY • link 4.3 years ago by Mick ▴ 30