Question

Automatically detecting paired end reads.

0

Entering edit mode

10 months ago

sam.himes92 • 0

I am creating a pipeline that will run some analysis on fastq.gz files. As part of my pipeline I need to check to see if the fastq files provided are paired end or single end reads. If they are paired end reads I run fast-join on them, otherwise they just get fed directly into my pipeline.

My question is what is the best way of auto-detecting paired end fastq files?

So far, this is the method that I've been using.

Start by looking at all the file names. In my case they look like this.

run1X1_220401_A00421_0429_AH3JCHDRX2_S1_L001_R1_001_subset.fastq
run1X1_220401_A00421_0429_AH3JCHDRX2_S1_L002_R1_001_subset.fastq
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R1_001_subset.fastq
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R2_001_subset.fastq

In this example set of names. Only sample 8 (the last two files) should be paired. This is indicated by the file names matching except for the R1 and R2 section. My script compares all the file names. If there is only 1 difference between the two file names AND that one difference is between the R1 and R2 section those files are designated as paired end and will be fed into fast-join. So the sample 1 files (the first two files) wouldn't be paired because while there is only one difference between the file names, the difference isn't in the R section.

This method works well enough for me now. But I don't know if it would work well with other fastq file name formats. I'm curious if there is a more standard approach. Any feedback/advice would be greatly appreciated.

pipeline automation paired-end single-end fastq • 866 views

ADD COMMENT • link updated 10 months ago by Ram 44k • written 10 months ago by sam.himes92 • 0

1

Entering edit mode

It is unusual to have a mixed dataset like this. Does this actually indicate SE and PE mixed dataset?

ADD REPLY • link 10 months ago by GenoMax 142k

0

Entering edit mode

You're right. Most of the time the data will either be all SE or PE. But there are some edge cases where this isn't the case. For example, the DNA samples were PE and the RNA samples were SE. I just want my pipeline to robust enough to handle any such case with little to no requirement of action on the end of the user.

ADD REPLY • link 10 months ago by sam.himes92 • 0

score 1 · Accepted Answer · 2023-06-28

1

Entering edit mode

10 months ago

Ram 44k

I'd pick all the R1 fastqs and check for an existence of corresponding R2s:

for f in *_R1_*
do
  if [ -f ${f/_R1_/_R2_} ]; then
    ## Paired end logic here
  else
    ## Single end logic here
  fi
done

ADD COMMENT • link 10 months ago by Ram 44k

0

Entering edit mode

That is essentially what I'm doing. I just worry that other fastq file name formats may not work with this approach. Is doing what you described a common practice when it comes to auto-detecting paired reads?

ADD REPLY • link 10 months ago by sam.himes92 • 0

0

Entering edit mode

It is common convention to name reads _R1_001 and _R2_001 (in current gen short read sequencing tech at least). I think there used to be further divisions earlier - that's what the last three digits are for, but now that doesn't happen. You should be fine with this unless you're dealing with special datasets such as scATAC-seq (where the paired reads are in _R1_ and _R3_ files).

ADD REPLY • link 10 months ago by Ram 44k

0

Entering edit mode

Great! That helps me to rest easy. Thank you!

ADD REPLY • link 10 months ago by sam.himes92 • 0

0

Entering edit mode

Please accept my answer to mark the post as solved.

ADD REPLY • link 10 months ago by Ram 44k