Can someone please help me understand why my fastq file looks like this?
1
0
Entering edit mode
19 days ago
DNAngel ▴ 260

I am trying to run QIIME dada2 pipeline on a bunch of samples that I got from a previously published study. I got their raw fastq sequence data but I am running into the issue where it keeps failing at the trimming step (I am using 0 truncation and 0 trimming becasue the visualization shows phred 30 across the entire read for all reads).

Here's what a fastq file looks like and I don't understand why there are duplicate reads, many with "???" instead of a sequence.

@SRR20667927.1 1 length=251
TACGGAGGGTGCTAGCGTTAATCGGAATTACTGGGCGTAAAGGGCACGCAGGCGGTTAATTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTCAATACTGGTTAGCTAGAGTCTTTTAGAGGGGGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGGCAGAGACTGACGCTCATGTGCGAAAGCGTGGGGAGCACACA +SRR20667927.1 1 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.1 1 length=251
NCTGTTTGCTCCCCACGCTTTCGCACCTGAGCGTCAGTCTCTGTCCAGGGGGCCGCCTTCGCCAACGGTATTCCTCCACATCTCTACGCATTTCACCGCTACCCATGGAATTCTACCCCCCTCTACAAGACTCTAGCTAACCAGTCTGGAATGCCATTCCCACGTTAAGCCCGGGGATTTCACATCCAACTTAATTAACCGCCTGCGTGCCCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTGCG +SRR20667927.1 1 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.2 2 length=251
TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGGGCACGCAGGCGGTTTATTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTCAAGACTGGTTAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAGAGACTGACGCTCATGTGCGAAAGCGTGGGGAGCAAACA +SRR20667927.2 2 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.2 2 length=251
NCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTCTCTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCACATCTCTACGCATTTCACCGCTACACATGGAATTCTACCCCCCTCTACAAGACTCTAGCTAACCAGTCTTGAATGCCATTCCCAGGTTAAGCCCGGGGATTTCACATCCAACTTAATTAACCGCCTGCGTGCCCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTCCG +SRR20667927.2 2 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.3 3 length=251
GATAAACTTGCCTTGAGAAGTGAAACTGAGTTCTAAGAAAGAAGTACGTGCGTAGATGGAAGATTAAAAATAATCGACGTACAAGATGGAAAAAAGGAGAGATGTTTTAATTCGATCCGTAAGCACCGTTACGGTCGTATTAAGATTCCAGGCTTTTTGACTTCACTGCAACTCGCCGTAAATACGTATCAGCTGTGACGAATGGGAGCGTGTTTATTACGACACTAACAGCTTCACCAATCAATGATTAG +SRR20667927.3 3 length=251 ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

The error I get in QIIME is the following:

R version 4.3.3 (2024-02-29) Loading required package: Rcpp DADA2: 1.30.0 / Rcpp: 1.0.13 / RcppParallel: 5.1.9 2) Filtering ........................................................ 3) Learning Error Rates 265188844 total bases in 1056776 reads from 14 samples will be used for learning the error rates. Error rates could not be estimated (this is usually because of very few reads). Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

Please can someone help I have been stuck on this for days. I have never had issues before. I know the data is paired but it looks like they had uploaded it to NCBI as already merged R1 and R2 sequences, so for each sample I have one fastq file. I have used the

--input-format SingleEndFastqManifestPhred33V2

I don't think I need to separate these reads because I have never had to do that before with paired-end fastq data that I have received as a single file.

fastq • 623 views
ADD COMMENT
1
Entering edit mode
19 days ago
GenoMax 154k

Individual reads look fine as they are in fastq format (LINK). Not sure why you have duplicates though.

@SRR20667927.1 1 length=251
TACGGAGGGTGCTAGCGTTAATCGGAATTACTGGGCGTAAAGGGCACGCAGGCGGTTAATTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTCAATACTGGTTAGCTAGAGTCTTTTAGAGGGGGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGGCAGAGACTGACGCTCATGTGCGAAAGCGTGGGGAGCACACA 
+SRR20667927.1 1 length=251
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? 

The four lines in the fastq record should be on separate lines. You should check and make sure that is how they appear in the input file. If you moved them between a windows machine and unix/macOS then you may have messed up the line endings. Generally dos2unix utility can take care of that.


the visualization shows phred 30 across the entire read for all reads

You have data in SRAlite format, which does the following

SRA Lite files will be the same for each base within a given read (quality = 30 or 3, depending on whether the Read Filter flag is set to pass or reject). Data in the SRA Normalized Format will continue to have a .sra file extension, while the SRA Lite files have a .sralite file extension.


I know the data is paired but it looks like they had uploaded it to NCBI as already merged R1 and R2 sequences

That is not correct. It looks like you dumped the data out without splitting the reads into separate files. The SRA record is showing two separate reads here --> https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR20667927&display=reads

Just get the fastq reads directly from EBI:

https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/027/SRR20667927/SRR20667927_1.fastq.gz
https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/027/SRR20667927/SRR20667927_2.fastq.gz
ADD COMMENT
0
Entering edit mode

How did you find out how to get the fastq reads directly like that? I was looking and looking and the only information I had was to either use SRAtoolkit which I did not need because there are so few sequences I could manually download the sequences from here (as an example for one sequenec): https://www.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR20667907&display=metadata.

So when going to the Download tab and hitting download Fastq, what is the difference? Sorry it's my first time so I was following instructions from here to get fastq sequences given only a bioproject ID: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/

And I followed instructions from "Download sequence data from the Run Browser" after I had obtained the accession ID list from the BioProject.

ADD REPLY
0
Entering edit mode

I guess if I have a list of accession IDs for the samples I want, I don't understand how to get the proper fastq files given only a BioProject ID. I thought what I did earlier was correct but looks like it is not.

ADD REPLY
0
Entering edit mode

How did you find out how to get the fastq reads directly like that?

You can use https://sra-explorer.info/ to get direct fastq download links for data from EBI (for most accessions). Here are directions on how to use the tool: sra-explorer : find SRA and FastQ download URLs in a couple of clicks

Looks like downloading the data from https://www.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR20667907&display=download results in the read duplicates. I assume R1 and R2 follow each other in that single file.

ADD REPLY

Login before adding your answer.

Traffic: 3254 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6