Forum:How do you organize sequence data (FASTQs)?
3
4
Entering edit mode
6.1 years ago
Daniel E Cook ▴ 280

I'm within a lab that has performed a lot of sequencing. We've sequenced a lot of samples, sequencing many over time, on different platforms (Illumina, PacBio), with different read lengths (50-150bp), with different library preparation methods (Tagmentation, Sonication), using different sequencing centers, and different types of sequencing (RNA-Seq, WGS, small RNA, etc).

At this point, we have terabytes of sequencing data and it's beginning to become unwieldy. I have a variety of questions surrounding how to manage sequence data and how to use it.

  1. How do you organize your sequencing data?
  2. What file structure do you use?
  3. How do you retrain metadata regarding the sequencing data?
  4. How do you backup your sequence data?
  5. How do FASTQs get fed into your bioinformatic pipelines?

It would be great to get a discussion going in this regard. I haven't seen too many questions on this subject (please direct me if I am mistaken!), but I imagine it is a problem many have to deal with.

Thanks!

RNA-Seq sequencing • 2.5k views
ADD COMMENT
1
Entering edit mode

This discussion can be broadly split into two categories.

  1. Individual labs (like @Daniel's)
  2. Core facilities

Requirements for these two are going to be very different. So it may be best to include in your answer which category you are referring to.

ADD REPLY
0
Entering edit mode

How do you backup your sequence data?

Use NCBI SRA as your backup (in addition to other things, tape, disks, cloud etc)? Let NCBI take care of it. Even unpublished data can be uploaded under an embargo until the publication is out.

ADD REPLY
0
Entering edit mode

see also this old post How Do You Manage Your Files & Directories For Your Projects ? (8 years ago ?!)

ADD REPLY
1
Entering edit mode
6.1 years ago

How do you organize your sequencing data?

Renaming the files works best for me. For e.g.

Raw file: ABC_XXX_R1.fq.gz
Renamed : Project_001_Sample_001_Organism_abc_R1.fq.gz

This could be shortened and a metadata or README.txt file could accompany in the same folder

What file structure do you use?

A very basic:

├───project_01_Apr_2018
│       Project_001_Sample_001_Organism_abc_R1.fq.gz
│       Project_001_Sample_001_Organism_abc_R2.fq.gz
│       ReadMe.txt
│
└───project_02_May_2018
        Project_002_Sample_002_Organism_acc_R1.fq.gz
        Project_002_Sample_002_Organism_acc_R2.fq.gz
        ReadMe.txt

How do you retrain metadata regarding the sequencing data?

Storing information like

  • vendor
  • date of data generation
  • organism
  • application (WGS, 16S)
  • platform
  • chemistry
  • data size

How do you backup your sequence data?

External hard-disk. Project management software work best.

How do FASTQs get fed into your bioinformatic pipelines?

  • pre processing data
  • genome assembly
  • this really depends on case by case basis
ADD COMMENT
1
Entering edit mode
6.1 years ago

How do you retrain metadata regarding the sequencing data?

don't store fastq, use ubam and store whatever you want in the bam header and read-groups

* FASTQ must die! Long live SAM/BAM! *

https://blastedbio.blogspot.fr/2011/10/fastq-must-die-long-live-sambam.html

ADD COMMENT
1
Entering edit mode
6.1 years ago
Eric Lim ★ 2.1k

We don't really store fq anymore. Reads, aligned or unaligned, are stored in sorted bam. If for any rare reason we need access back to fq, we just convert the bam back.

How do you organize your sequencing data?

We store all genomic related data in AWS S3, grouped by unique ID for each project, i.e. projects/genomics/GE0001/sample1/sample1.sorted.bam where GE stands for Genomic Projects.

What file structure do you use?

We primarily use snakemake with wildcards to build DAGs, so it's natural to simply append the extension to indicate what the file is for. For instance, sample1.star.coding.altevents.a5.disease.txt indicates the data is processed via STAR, filtered for coding regions, and it contains alternative 5'ss events with disease annotations.

How do you retrain metadata regarding the sequencing data?

For the most part, plain old YAML, location in each genomic project directory.

How do you backup your sequence data?

We use S3 versioning. For internal data, we also have copies of fq or bam stored in local hard drive.

How do FASTQs get fed into your bioinformatic pipelines?

We run snakemake in AWS Batch environment. Data are permanently stored in S3, staged in to temporary storage during the execution of the workflow, and stage out to S3 when done.

ADD COMMENT

Login before adding your answer.

Traffic: 1525 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6