Question

Can someone please help me understand why my fastq file looks like this?

0

Entering edit mode

19 days ago

DNAngel ▴ 260

I am trying to run QIIME dada2 pipeline on a bunch of samples that I got from a previously published study. I got their raw fastq sequence data but I am running into the issue where it keeps failing at the trimming step (I am using 0 truncation and 0 trimming becasue the visualization shows phred 30 across the entire read for all reads).

Here's what a fastq file looks like and I don't understand why there are duplicate reads, many with "???" instead of a sequence.

@SRR20667927.1 1 length=251
TACGGAGGGTGCTAGCGTTAATCGGAATTACTGGGCGTAAAGGGCACGCAGGCGGTTAATTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTCAATACTGGTTAGCTAGAGTCTTTTAGAGGGGGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGGCAGAGACTGACGCTCATGTGCGAAAGCGTGGGGAGCACACA +SRR20667927.1 1 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.1 1 length=251
NCTGTTTGCTCCCCACGCTTTCGCACCTGAGCGTCAGTCTCTGTCCAGGGGGCCGCCTTCGCCAACGGTATTCCTCCACATCTCTACGCATTTCACCGCTACCCATGGAATTCTACCCCCCTCTACAAGACTCTAGCTAACCAGTCTGGAATGCCATTCCCACGTTAAGCCCGGGGATTTCACATCCAACTTAATTAACCGCCTGCGTGCCCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTGCG +SRR20667927.1 1 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.2 2 length=251
TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGGGCACGCAGGCGGTTTATTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTCAAGACTGGTTAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAGAGACTGACGCTCATGTGCGAAAGCGTGGGGAGCAAACA +SRR20667927.2 2 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.2 2 length=251
NCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTCTCTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCACATCTCTACGCATTTCACCGCTACACATGGAATTCTACCCCCCTCTACAAGACTCTAGCTAACCAGTCTTGAATGCCATTCCCAGGTTAAGCCCGGGGATTTCACATCCAACTTAATTAACCGCCTGCGTGCCCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTCCG +SRR20667927.2 2 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.3 3 length=251
GATAAACTTGCCTTGAGAAGTGAAACTGAGTTCTAAGAAAGAAGTACGTGCGTAGATGGAAGATTAAAAATAATCGACGTACAAGATGGAAAAAAGGAGAGATGTTTTAATTCGATCCGTAAGCACCGTTACGGTCGTATTAAGATTCCAGGCTTTTTGACTTCACTGCAACTCGCCGTAAATACGTATCAGCTGTGACGAATGGGAGCGTGTTTATTACGACACTAACAGCTTCACCAATCAATGATTAG +SRR20667927.3 3 length=251 ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

The error I get in QIIME is the following:

R version 4.3.3 (2024-02-29) Loading required package: Rcpp DADA2: 1.30.0 / Rcpp: 1.0.13 / RcppParallel: 5.1.9 2) Filtering ........................................................ 3) Learning Error Rates 265188844 total bases in 1056776 reads from 14 samples will be used for learning the error rates. Error rates could not be estimated (this is usually because of very few reads). Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

Please can someone help I have been stuck on this for days. I have never had issues before. I know the data is paired but it looks like they had uploaded it to NCBI as already merged R1 and R2 sequences, so for each sample I have one fastq file. I have used the

--input-format SingleEndFastqManifestPhred33V2

I don't think I need to separate these reads because I have never had to do that before with paired-end fastq data that I have received as a single file.

fastq • 623 views

ADD COMMENT • link updated 18 days ago by GenoMax 154k • written 19 days ago by DNAngel ▴ 260

score 1 · Answer 1 · 2025-11-13

Individual reads look fine as they are in fastq format (LINK). Not sure why you have duplicates though.

@SRR20667927.1 1 length=251
TACGGAGGGTGCTAGCGTTAATCGGAATTACTGGGCGTAAAGGGCACGCAGGCGGTTAATTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTCAATACTGGTTAGCTAGAGTCTTTTAGAGGGGGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGGCAGAGACTGACGCTCATGTGCGAAAGCGTGGGGAGCACACA 
+SRR20667927.1 1 length=251
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

The four lines in the fastq record should be on separate lines. You should check and make sure that is how they appear in the input file. If you moved them between a windows machine and unix/macOS then you may have messed up the line endings. Generally dos2unix utility can take care of that.

the visualization shows phred 30 across the entire read for all reads

You have data in SRAlite format, which does the following

SRA Lite files will be the same for each base within a given read (quality = 30 or 3, depending on whether the Read Filter flag is set to pass or reject). Data in the SRA Normalized Format will continue to have a .sra file extension, while the SRA Lite files have a .sralite file extension.

I know the data is paired but it looks like they had uploaded it to NCBI as already merged R1 and R2 sequences

That is not correct. It looks like you dumped the data out without splitting the reads into separate files. The SRA record is showing two separate reads here --> https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR20667927&display=reads

Just get the fastq reads directly from EBI:

https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/027/SRR20667927/SRR20667927_1.fastq.gz
https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/027/SRR20667927/SRR20667927_2.fastq.gz