I am trying to run QIIME dada2 pipeline on a bunch of samples that I got from a previously published study. I got their raw fastq sequence data but I am running into the issue where it keeps failing at the trimming step (I am using 0 truncation and 0 trimming becasue the visualization shows phred 30 across the entire read for all reads).
Here's what a fastq file looks like and I don't understand why there are duplicate reads, many with "???" instead of a sequence.
@SRR20667927.1 1 length=251
TACGGAGGGTGCTAGCGTTAATCGGAATTACTGGGCGTAAAGGGCACGCAGGCGGTTAATTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTCAATACTGGTTAGCTAGAGTCTTTTAGAGGGGGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGGCAGAGACTGACGCTCATGTGCGAAAGCGTGGGGAGCACACA +SRR20667927.1 1 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.1 1 length=251
NCTGTTTGCTCCCCACGCTTTCGCACCTGAGCGTCAGTCTCTGTCCAGGGGGCCGCCTTCGCCAACGGTATTCCTCCACATCTCTACGCATTTCACCGCTACCCATGGAATTCTACCCCCCTCTACAAGACTCTAGCTAACCAGTCTGGAATGCCATTCCCACGTTAAGCCCGGGGATTTCACATCCAACTTAATTAACCGCCTGCGTGCCCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTGCG +SRR20667927.1 1 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.2 2 length=251
TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGGGCACGCAGGCGGTTTATTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAATGGCATTCAAGACTGGTTAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAGAGACTGACGCTCATGTGCGAAAGCGTGGGGAGCAAACA +SRR20667927.2 2 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.2 2 length=251
NCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTCTCTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCACATCTCTACGCATTTCACCGCTACACATGGAATTCTACCCCCCTCTACAAGACTCTAGCTAACCAGTCTTGAATGCCATTCCCAGGTTAAGCCCGGGGATTTCACATCCAACTTAATTAACCGCCTGCGTGCCCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTCCG +SRR20667927.2 2 length=251 ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? @SRR20667927.3 3 length=251
GATAAACTTGCCTTGAGAAGTGAAACTGAGTTCTAAGAAAGAAGTACGTGCGTAGATGGAAGATTAAAAATAATCGACGTACAAGATGGAAAAAAGGAGAGATGTTTTAATTCGATCCGTAAGCACCGTTACGGTCGTATTAAGATTCCAGGCTTTTTGACTTCACTGCAACTCGCCGTAAATACGTATCAGCTGTGACGAATGGGAGCGTGTTTATTACGACACTAACAGCTTCACCAATCAATGATTAG +SRR20667927.3 3 length=251 ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
The error I get in QIIME is the following:
R version 4.3.3 (2024-02-29) Loading required package: Rcpp DADA2: 1.30.0 / Rcpp: 1.0.13 / RcppParallel: 5.1.9 2) Filtering ........................................................ 3) Learning Error Rates 265188844 total bases in 1056776 reads from 14 samples will be used for learning the error rates. Error rates could not be estimated (this is usually because of very few reads). Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.
Please can someone help I have been stuck on this for days. I have never had issues before. I know the data is paired but it looks like they had uploaded it to NCBI as already merged R1 and R2 sequences, so for each sample I have one fastq file. I have used the
--input-format SingleEndFastqManifestPhred33V2
I don't think I need to separate these reads because I have never had to do that before with paired-end fastq data that I have received as a single file.
How did you find out how to get the fastq reads directly like that? I was looking and looking and the only information I had was to either use SRAtoolkit which I did not need because there are so few sequences I could manually download the sequences from here (as an example for one sequenec): https://www.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR20667907&display=metadata.
So when going to the Download tab and hitting download Fastq, what is the difference? Sorry it's my first time so I was following instructions from here to get fastq sequences given only a bioproject ID: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/
And I followed instructions from "Download sequence data from the Run Browser" after I had obtained the accession ID list from the BioProject.
I guess if I have a list of accession IDs for the samples I want, I don't understand how to get the proper fastq files given only a BioProject ID. I thought what I did earlier was correct but looks like it is not.
You can use https://sra-explorer.info/ to get direct fastq download links for data from EBI (for most accessions). Here are directions on how to use the tool: sra-explorer : find SRA and FastQ download URLs in a couple of clicks
Looks like downloading the data from https://www.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR20667907&display=download results in the read duplicates. I assume R1 and R2 follow each other in that single file.