Question

Understanding NCBI vs ENA data

0

Entering edit mode

4 months ago

wes ▴ 90

I want to download PacBio RSII data with accession number SRR6037732 for further analysis. In NCBI, there are both SRA archive files and original format data available.

How can I identify which file contains subreads without PacBio adapter contamination?

Under the original format section, there is a file listed as type pacbio_native, available at this AWS link: https://sra-pub-src-1.s3.amazonaws.com/SRR6037732/D1_filtered_subreads.fastq.gz.1. Since the file is named "filtered_subreads", can I assume it is free from PacBio adapter contamination?

A similar file is available on the ENA, named SRR6037732_subreads.fastq.gz. Is this file identical to the D1_filtered_subreads.fastq.gz.1 file from NCBI?

ENA NCBI SRA • 784 views

ADD COMMENT • link updated 4 months ago by GenoMax 154k • written 4 months ago by wes ▴ 90

2

Entering edit mode

Both places should have checksums available - you could look if they were the same.

ADD REPLY • link 4 months ago by i.sudbery 22k

0

Entering edit mode

Thanks for pointing me to check their checksums.

ADD REPLY • link 4 months ago by wes ▴ 90

0

Entering edit mode

How can I ensure that the subread file is free of PacBio adapter contamination, apart from checking with FastQC? Although the FastQC results show no obvious adapter contamination, I’m concerned there might be residual adapters that were not detected, which could potentially affect the assembly process. Since the data from the PacBio RSII system undergoes primary read preprocessing onboard the instrument, does that mean the output is guaranteed to be free of adapter contamination?

ADD REPLY • link 4 months ago by wes ▴ 90

1

Entering edit mode

While the file name seems to indicate the data is filtered, you can use lima (LINK) to confirm that. You will need to know which library prep method was used for your data. You can also use one of the workflows PacBio provides, if it fits your needs: https://github.com/PacificBiosciences

ADD REPLY • link 4 months ago by GenoMax 154k