Question

Forum:How relevant are read numbers in FastQ headers for paired reads?

1

Entering edit mode

24 months ago

Matthias Zepper 4.5k

Hello everybody,

May I ask for your opinion on the importance of having the read number correctly embedded in the FastQ headers?

For example:

@A00689:468:H2CKFDSX3:1:1101:25437:1016/1 and @A00689:468:H2CKFDSX3:1:1101:25437:1016/2

or

@A00689:468:H2CKFDSX3:1:1101:25437:1016 1:N:0 and @A00689:468:H2CKFDSX3:1:1101:25437:1016 2:N:0

Are there tools that rely on those patterns or you do take a look when receiving new FastQs from the sequencing facility?

I am asking, because we started to use kits that have a UMI embedded in the sequencing adapter, which is read before the second read and output into a separate FastQ file. Because we output the UMI before the second read, bcl-convert will embed the read number 2 into the UMI reads and 3 into the headers of the mate reads.

Therefore, we ponder how big of an issue this will be, e.g. cause malfunction with downstream tools and confuse bioinformaticians? Should the read number in your opinion be changed back to e.g. /1 and /2 or would /1 and /3 be fine as well?

Sharing your opinion on this would be greatly appreciated! Thanks a lot

Matthias

header fastq number read • 1.5k views

ADD COMMENT • link 23 months ago by Matthias Zepper 4.5k

score 2 · Accepted Answer · 2022-04-27

2

Entering edit mode

24 months ago

GenoMax 141k

Are there tools that rely on those patterns or you do take a look when receiving new FastQs from the sequencing facility?

Tools will look at the fastq headers to ensure that paired-end reads are next to each other when files are sorted (e.g. BAM) so samtools, featureCounts etc. Tools that can mark read duplicates (optical/PCR) will use those coordinates e.g. clumpify.sh from BBMap suite.

Should the read number in your opinion be changed back to e.g. /1 and /2 or would /1 and /3 be fine as well?

That is the old style of Illumina identifiers and not actively used now. It is more a matter of software being aware of the reads containing UMI. Since UMI's can be in different places, tools like fgbio and umi-tools will handle UMI's. UMI's will only be output in specific runs where one is using them.

ADD COMMENT • link 24 months ago by GenoMax 141k

0

Entering edit mode

Thank you very much for your insightful and quick response!

Tools will look at the fastq headers to ensure that paired-end reads are next to each other when files are sorted (e.g. BAM) so samtools, featureCounts etc. Tools that can mark read duplicates (optical/PCR) will use those coordinates e.g. clumpify.sh from BBMap suite.

I guess, I should just run a few tests with those tools...thanks for pointing out which ones might be affected. But don't they rather rely on the accordance of the lane:tile:x_pos:y_pos part of the read ID to verify pairs? In this case, it might be acceptable if the read number of the mate is 3:N:0 instead of 2:N:0?

That is the old style of Illumina identifiers and not actively used now.

I was just aware that there are different notations (sometimes even using an underscore), but didn't know which one is the current standard. Thanks!

It is more a matter of software being aware of the reads containing UMI.

We discussed, if we should deliver the files with already embedded UMIs, but eventually felt that delivering three FastQ files would be more flexible. Subsequently, it would still be possible to embed the UMIs as required for the tool of choice, whereas users not interested in using UMIs throughout the analysis could just ignore the third file.

ADD REPLY • link 24 months ago by Matthias Zepper 4.5k

0

Entering edit mode

Do files with UMI reads get the name I1 or do the files get the name R2? If so most of the software would probably key off those file names rather than the fastq header. You are surely not running UMI's for every run you are doing?

ADD REPLY • link 24 months ago by GenoMax 141k

0

Entering edit mode

To know for sure, I would need to ask, since I am unfortunately one of those bioinformaticians for whom sequencing data comes into existence as FastQ ;-), but to my best knowledge the indexes used for demultiplexing are read separately from the UMI.

When I get the files, they are for example called Sample_L001_R1_001.fastq.gz, Sample_L001_R2_001.fastq.gz, Sample_L001_R3_001.fastq.gz and the R2 is the UMI.

Indeed, we do not read UMIs for every run, but since we are using the IDT adapters, they are often in there anyway. Therefore, the decision was to start sequencing them when appropriate, basically for all quantitative experiments. Apparently, this is now possible since Illumina upgraded their kits to contain enough reagents to run the regular cycle number plus UMIs. Before that, I was told, we often had to snatch a few cycles away from the reads, if we wanted to have UMIs, too.

ADD REPLY • link 24 months ago by Matthias Zepper 4.5k

1

Entering edit mode

IDT has a nice tech note available that details how xGen Prism data needs to be processed. I assume this is the kit you are referencing. It does require sort of non standard processing.