Question

Combining and Forward and Reverse Reads

0

Entering edit mode

5 weeks ago

SineWave • 0

Hello Biostar community,

I'm currently working on aligning paired-end FASTQ files from a mig-SEQ experiment to a reference genome. After filtering with fastp and removing chloroplast DNA using Bowtie2 with a chloroplast genome as reference, I have forward and reverse reads that don't match in length exactly.

I encountered an issue when using the following Bowtie2 command:

   bowtie2 -x <index_prefix> -1 <forward_fastq> -2 <reverse_fastq> -S <output_sam>

The process terminated with an error message: "Error, fewer reads in file specified with -1 than in file specified with -2."

Thus, it seems as if the best course of action is to combine the the forward and reverse reads first, and then just align the combined fastq file to the index. My question is: what is the best way to combine these reads?

I've noticed that some researchers concatenate the forward and reverse reads before alignment. However, this approach seems somewhat crude. If concatenation is indeed a valid step, should I consider taking the reverse complement of the reverse read before concatenating? This way, all reads would be in the same direction, which seems to be an expectation of Bowtie2.

As I'm relatively new to bioinformatics, I apologize if my question is not well-informed or lacks clarity. Any guidance or suggestions would be greatly appreciated.

Thank you in advance for your help.

Best regards,

Alexander

genome alignment concatenation sequencing • 385 views

ADD COMMENT • link 4 weeks ago by SineWave • 0

score 1 · Answer 1 · 2024-03-21

You should not concatenate forward and reverse reads. There is usually a gap between the reads which is informative for mapping, as large deviations for an expected gap can be used for SV detection, for example. If your reads instead overlap, then you are creating a spurious repeated element within every read, and your mapping will suffer.

What I suspect has happened if that when removing reads that mapped to a chloroplast, you've not correctly removed read pairs, and messed up the order of reads. Remember, every read in the R1 is associated with the same number read in the R2. So if you remove more R1s, then you completely mess up the association between forward and reverse reads.

I think a better solution is to add a chloroplast reference to your reference genome, and map with your adapter and quality trimmed reads. In doing it this way, only reads with primary mapping location to the chloroplast genome will be removed. As well as likely resolving your error.

score 0 · Answer 2 · 2024-03-21

0

Entering edit mode

5 weeks ago

GenoMax 141k

You could also use bbduk.sh (GUIDE) from BBMap suite in filter mode using the chloroplast reference (or use bbsplit.sh against the "reference + chloroplast" genome to filter (or bin with bbsplit) the reads that map to the chloroplast.

ADD COMMENT • link 5 weeks ago by GenoMax 141k