Question: Software to extract run information from FASTQ headers?
1
gravatar for James Ashmore
18 months ago by
James Ashmore3.0k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore3.0k wrote:

Given a directory containing multiple FASTQ files, I would like to retrieve the run information from each file (e.g. flowcell, lane number, index e.t.c) and export the data to a CSV file. Before I write a script to do this myself, is anyone aware of any software which does this already?

fastq • 746 views
ADD COMMENTlink modified 18 months ago by Pierre Lindenbaum129k • written 18 months ago by James Ashmore3.0k

Writing a one-liner than does this should take less time than it did to write your post. This is homework, isn't it?

Edit. considering your rep, maybe it isn't..

ADD REPLYlink modified 18 months ago • written 18 months ago by 5heikki8.9k

This isn't homework. There seems to be a surprising number of edge-cases depending on which version of CASAVA was used to convert from BCL to FASTQ format.

ADD REPLYlink written 18 months ago by James Ashmore3.0k

How is the required information encoded in the FASTQ headers, and is it at all? I am not sure if there is standard format for fastq headers. The information might be available as meta data from your sequencing provider, it might also be encoded in the file name. If you could provide an example of your filenames and headers, someone might be able to help you with a quick sed|grep|awk script.

ADD REPLYlink written 18 months ago by Michael Dondrup47k
1

While there isn't a really a standard, read names from illumina machines have a more or less common format as long as the provider hasn't changed them. I guess it would be useful to extract things like the machine model and id, the folow cell id, match those things up to databases of what the strings mean (i.e. identify file one as coming from a HiSeq 2500 and file two as coming form a NovaSeq etc...).

Sounds like it might be quite useful, but I've not seen a tool that does it before.

ADD REPLYlink written 18 months ago by i.sudbery8.4k
1

In my case I want to create read group information as explained by GATK by automating the extraction of run information and creating the read groups for each run/sample.

ADD REPLYlink written 18 months ago by James Ashmore3.0k

I think Illumina has a standardized format

So basically it's something like (maybe not exactly, I don't know if there are more than 1 lanes, tiles, index, whatever in one file..):

find /some/place -maxdepth 1 -name "*.fq" | xargs -I {} awk -v n="{}" 'BEGIN{FS=":";OFS=","}NR==1{print n,$2,$3..}' {}

ADD REPLYlink modified 18 months ago • written 18 months ago by 5heikki8.9k
2
gravatar for Pierre Lindenbaum
18 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

I wrote illuminadir for my lab http://lindenb.github.io/jvarkit/IlluminaDirectory.html . output is a json or xml file;

$ find . -type f -name "*.fastq.gz" |  java  -jar dist/illuminadir.jar -j 
{
    "samples": [
        {
            "files": [
                {
                    "forward": {
                        "file-size": 82852,
                        "md5filename": "3854678b2381caa92758a980eb0d180d",
                        "path": "./src/test/resources/SAMPLE1_GATGAATC_L002_R1_001.fastq.gz",
                        "side": 1
                    },
                    "id": "p1",
                    "index": "GATGAATC",
                    "lane": "2",
                    "md5pair": "6fb0b6d5dec57436724e8efcb736aed5",
                    "reverse": {
                        "file-size": 82625,
                        "md5filename": "5450dea531fb2d807402d6bcc59cf15d",
                        "path": "./src/test/resources/SAMPLE1_GATGAATC_L002_R2_001.fastq.gz",
                        "side": 2
                    },
                    "split": "1"
                }
            ],
            "sample": "SAMPLE1"
        }
    ],
    "undetermined": []
}
ADD COMMENTlink modified 18 months ago • written 18 months ago by Pierre Lindenbaum129k

Thanks Pierre - should be able to convert from JSON to CSV on the fly.

ADD REPLYlink written 18 months ago by James Ashmore3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1593 users visited in the last hour