Question

Paired reads positions in FASTQ files

1

Entering edit mode

7.1 years ago

riccardo ▴ 90

Hello, I have a question about the paired end sequencing. When you have the FASTQ files of the read1 and the read2, that come from a paired sequencing, is it correct to assume that if in the position 1, of the R1 file, you have the read X in the same position of the R2 file you have the paired reads of X? Because if this is not true you need to check the name of millions of sequences and it will be very time consuming, if only one reads is missing or is in the incorrect order in R1 or R2 you will have reads paired incorrectly, could this happen? Do you know if the aligners check the names of the reads when they align paired reads or they just rely on the position of the reads? Thank you.

Best

sequencing • 5.6k views

ADD COMMENT • link updated 7.1 years ago by GenoMax 144k • written 7.1 years ago by riccardo ▴ 90

score 1 · Answer 1 · 2017-06-23

That is a correct assumption and tools do not check these names. That is also why if you use tools that filter for quality you get 4 files: 2 paired and 2 unpaired. That said I've acutally had problems with propper pairing in the past so it is a good thing always to check the first sequences (basically comparing the sequence names which tells you alot about the sequencing run).

score 1 · Answer 2 · 2017-06-23

1

Entering edit mode

7.1 years ago

GenoMax 144k

Order of read in R1/R2 files can get out of sync if you scan/trim the two files independently. That is the reason it is recommended that you use a paired-end aware scan/trim program along with the pair of files.

If you happen to have files that have gone out of sync you can use repair.sh from BBMap suite to restore the pairing of reads (see for command line example: C: Calculating number of reads for paired end reads? ) You could also use this as a diagnostic tool instead instead of comparing read names manually. If the files are in sync then there would be no change done.

ADD COMMENT • link 7.1 years ago by GenoMax 144k

1

Entering edit mode

Hi, is this also possible if you consider the original files that the sequencer gives you in output? Thanks

ADD REPLY • link 7.1 years ago by riccardo ▴ 90

1

Entering edit mode

If no trimming has been done for the data then they should be in sync. If in doubt run repair.sh to be sure. If data is in sync then nothing should appear in the singleton's file.

ADD REPLY • link 7.1 years ago by GenoMax 144k

0

Entering edit mode

Hi, I think it's really not necessary if nothing appears in singleton's file your R1 and R2 are in sync. I have good quality sequencing data which required no trimming. Also starting reads were in sync. However, there were few reads in mid of the file that was out of sync. You can't always say at the face value if R1 and R2 are in sync until you face an error during the alignment step which is as follows:

[mem_sam_pe] paired reads have different names: "A00804:41:HNJ53DSXX:2:1165:1362:17660", "A00804:41:HNJ53DSXX:2:1145:26630:1047"

A more general question that comes to my mind and I haven't found an answer to is is it a sequencing defect or something went haywire during demultiplexing. Because I have such issue for all the samples that were run on a single flow cell. Quite strange though!

ADD REPLY • link 4.0 years ago by rohitsatyam102 ▴ 900

0

Entering edit mode

I think it's really not necessary if nothing appears in singleton's file your R1 and R2 are in sync

That should not happen if you are using repair.sh tool. If your files are not in sync it should flag those.

If your have reads that have the relevant part of identifiers (e.g. 1:Y:18:ATCACG) stripped away from fastq headers then it would be difficult for any program to find if reads are out of sync.

it a sequencing defect or something went haywire during demultiplexing.

Are you referring to original read files? No manipulation has been done to them after they came off the sequencher/demultiplexing before you started these alignments?

ADD REPLY • link 4.0 years ago by GenoMax 144k

0

Entering edit mode

Precisely. I have files where no manipulation was done post demultiplexing. When I align it to the reference I get an error paired reads have different names. I used repair.sh script to reorder the files. My singleton files (for all 4 samples) are empty. However, post repair.sh the error disappears.

Since I didn't perform any preprocessing on fastq files and went on for direct alignment, I suspect something might have gone fishy during demultiplexing. But I don't have any evidence/explanation on why would it happen during demultiplexing.

ADD REPLY • link 4.0 years ago by rohitsatyam102 ▴ 900

1

Entering edit mode

However, post repair.sh the error disappears.

So repair.sh does work as intended. There are very rare errors like this in the output of bcl2fastq. One speculation I have is that these files were made using a file system that was not performant. It may not have kept up with the processes that wrote the output file properly.

But your point is well taken. In this specific instance, singleton files will be empty, after repair.sh does its job.

ADD REPLY • link 4.0 years ago by GenoMax 144k