Question: Q:Data files naming schemes
2
gravatar for Darked89
16 months ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

I am getting a piles of fastq files with either generic (R12345.r1.fq) or plainly confusing (170811p1pt.r1.fq).

While storing md5, project name etc. helps a bit, I feel that without a massive scale rename I will not be able to make a sense of the results, or even get the results in the first place.

Do you guys require that such files have some labels (dna, rna, net) followed if needed by say wgs, exo, tg1 etc? Wet lab ppl are multiplexing samples and dumping sequencing folders with rather spartan Excel metadata. No LIMS, no consistent naming schemes.

I am renaming everything to stay sane (keeping CSV files with old_name, new_name, flowcell, machine_id, number of reads, run_date,).

I will be greatful for the suggestions how to improve it. CSV -->> DB with a frontend is obvious.

sequencing • 301 views
ADD COMMENTlink modified 16 months ago by Pierre Lindenbaum134k • written 16 months ago by Darked894.2k

A naming scheme that would work universally is difficult to implement. If you deal with tens of thousands of samples for a large consortium project then short of a LIMS/DB nothing will work.

One of the issues we deal with in a core facility is people naming their samples Samaple_101, Sample_201 etc. While it makes perfect sense for them (a code if you will) it obviously causes issues on core end. A unique identifier that is automatically generated (that does not need to be human readable) is one way of avoiding this issue. Translation of the names can also be done on the fly (store the file with any name you want) your users will see the name they are familiar with on front end. This would only work if they are accessing results you produce indirectly (via a portal for example).

If more than just you needs to access/work on the data then implementing a proper tracking system would pay dividends in long term. Even after you leave.

ADD REPLYlink modified 16 months ago • written 16 months ago by GenoMax96k
0
gravatar for Pierre Lindenbaum
16 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

For FASTQ, you can store everything as a UBAM file. Here you can put a description of the samples, date, project in the SAM header etc... https://gatkforums.broadinstitute.org/gatk/discussion/5990/what-is-ubam-and-why-is-it-better-than-fastq-for-storing-unmapped-sequence-data

ADD COMMENTlink written 16 months ago by Pierre Lindenbaum134k

I have tried that route, but got stuck at the downstream data processing. Meaning: if one does not implement the entire Broad pipeline, I still have to parse ubam's say RG info myself. Also being on the slow net I tend to shrink the data with clumpify from BBMap and pigz/pbzip2. I need to check if ubams are of comparay size. Last but not least: the renaming if done right permits brain dead mv -i pattern.fq.gz destination/

Less mental energy consuming than mv files in this or that file list or RG group. Vanilla users can do it and check that things are going ok.

Btw, what is the proper way to use bwa/star with ubam's as an input?

ADD REPLYlink written 16 months ago by Darked894.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2069 users visited in the last hour
_