I want to download PacBio RSII data with accession number SRR6037732 for further analysis. In NCBI, there are both SRA archive files and original format data available.
How can I identify which file contains subreads without PacBio adapter contamination?
Under the original format section, there is a file listed as type pacbio_native, available at this AWS link: https://sra-pub-src-1.s3.amazonaws.com/SRR6037732/D1_filtered_subreads.fastq.gz.1. Since the file is named "filtered_subreads", can I assume it is free from PacBio adapter contamination?
A similar file is available on the ENA, named SRR6037732_subreads.fastq.gz. Is this file identical to the D1_filtered_subreads.fastq.gz.1 file from NCBI?
Both places should have checksums available - you could look if they were the same.
Thanks for pointing me to check their checksums.
How can I ensure that the subread file is free of PacBio adapter contamination, apart from checking with FastQC? Although the FastQC results show no obvious adapter contamination, I’m concerned there might be residual adapters that were not detected, which could potentially affect the assembly process. Since the data from the PacBio RSII system undergoes primary read preprocessing onboard the instrument, does that mean the output is guaranteed to be free of adapter contamination?
While the file name seems to indicate the data is filtered, you can use
lima
(LINK) to confirm that. You will need to know which library prep method was used for your data. You can also use one of the workflows PacBio provides, if it fits your needs: https://github.com/PacificBiosciences