Question

Automation And Post Processing Casava 1.8 Output

6

Entering edit mode

12.8 years ago

Jim Vallandingham ▴ 350

Moving from previous versions of Illumina's CASAVA pipeline to CASAVA 1.8, there are a number of changes in terms of input, output and file structure.

While some of these new 'features' are meant to make life easier for post-pipeline analysis (like the new project / sample directory structure), other less-heralded changes look like they were done, in my opinion, to make life easier for Illumina and not Illumina's customers.

My question is for those groups that have moved up to CASAVA 1.8 (or are looking at doing so) what kind of post processing do you do to the data files generated by this pipeline? Have you found any 'gotchas' in terms of changes new to 1.8? Are you automating the running of this pipeline and post-run functionality?

From A previous BioStar question, and from what I've heard, most people don't just use the stock CASAVA pipeline for all their analysis. But I have not seen much in terms of how people are wrangling the pipeline's awkward and ever-changing folder structure and output formats.

Specifically, from what I've seen, there are a few new changes in 1.8 that demand some post-run attention:

The fastq files for a specific lane / index are broken up based on file size and so must be concatenated back together.
- As the directory structure for how these fastq files are organized is now provided by the user, care needs to be taken to ensure you are concatenating the right files (in the right order?).
The reads that don't pass filter are still present in the unaligned fastq files.
- As I understand it, this was not the case previously - and so post-run analysis now needs to be aware of and deal with these bad reads, or they need to be removed before post-run analysis begins.
- The CASAVA documentation suggests using zcat and grep to exclude filtered reads. Here's the bash script they suggest:

    cd /path/to/project/sample
    mkdir filtered
    for fastq in *.fastq.gz ; do zcat $fastq | grep
      -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
    ; done

But this will produce invalid output (at least with our version of grep). Some problems with code this include:
- -A prints additional trailing lines. So you want 3 trailing lines in addition to the first line matched
- -A also adds a "--" between contiguous matches, so you need to remove those
- You are redirecting this to a file named [something].fastq.gz - but because you just expanded it with zcat it is no longer a .gz file!
- This causes tools that use the file name to determine operations (like fastqc) to fail.

I highlight this change to show the errors that can be introduced with this kind of post-processing, which is one of the reasons I'm asking if others are running into these kinds of issues.

So what are people doing about this? Are proprietary LIMS systems already on the ball and all these kinks are worked out? Are these changes too trivial to be mentioned publicly? Is no one else moving to CASAVA 1.8 yet?

Thanks
Jim

illumina pipeline next-gen sequencing • 8.6k views

ADD COMMENT • link updated 12.8 years ago by Tolyar ▴ 10 • written 12.8 years ago by Jim Vallandingham ▴ 350

0

Entering edit mode

+1, I was just to ask a similar question just now ;-)

ADD REPLY • link 12.8 years ago by Pablo Marin-Garcia ★ 2.0k

0

Entering edit mode

As an aside, here is a broad overview of the current automated pipeline we have with 1.8: 1) Create config.txt and SampleSheet.csv from LIMS data 2) Convert raw unaligned reads to Fastq format and Demultiplex lanes 3) Perform alignment to reference genome with ELAND 4) Aggregate and rename unaligned reads 5) Remove reads that do not pass filter 6) Analyze unaligned reads using fastqc 7) Aggregate and rename export files 8) Distribute data to project directories 9) Distribute stats and quality control analysis to permanent storage

ADD REPLY • link 12.8 years ago by Jim Vallandingham ▴ 350

score 2 · Answer 1 · 2011-07-08

2

Entering edit mode

12.8 years ago

Louis Letourneau ▴ 820

Well we actually only use casava for the fastq generation and we used to use it to align to phiX for to get the summary file. (We spiked 1% phiX in all the lanes).

Now we use SAV to check the phiX part since HCS does this alignment on the fly.

So what's left is only the demultiplexing. Doing this directly from the bcls is wonderfull for use.

Also you can configure the demux step to not breakup the files. Which we did. The only thing that's left is having passed filtered reads or not.

We used our own 'implementation' (modified grep command :-) ) to do this.

Overall other than minor tweaks to our lims to look at files in different places, CASAVA 1.8 simplified our pipeline.

Also since it starts directly form bcls, it's quite a bit faster than the bcl->qseq->demux->align/fastq->gzip pipeline.

now it's

bcl->demuxed, compressed fastqs

wonderful.

ADD COMMENT • link 12.8 years ago by Louis Letourneau ▴ 820

0

Entering edit mode

Good to here. I'm not saying that some of the changes aren't good, or useful (especially the use of bcl's directly!).

For Not splitting up the fastq files - are you referring to the --fastq-cluster-count parameter?

I was concerned about messing with that too much, because of the danger of 'silently failing' when mis-configured. Might be less of a problem if you aren't using Eland for alignment.

I too wrote an 'implementation' (zcat + ruby) to filter reads.

What do you have this set to?

Thanks again.

ADD REPLY • link 12.8 years ago by Jim Vallandingham ▴ 350

0

Entering edit mode

"hear" not "here".

Anyone have experiences where ELAND is used in conjunction with external tools?

ADD REPLY • link 12.8 years ago by Jim Vallandingham ▴ 350

0

Entering edit mode

Yes that's the parameter I used.

If you're not using eland, it's not a problem to set this high. We set it to max signed int 2 billion or so.

Well now eland can generate bams, so I guess any of the normal tools would work. Unless they don't set some things in the bam (mapping quality and such).

I've never used eland's output downstream

ADD REPLY • link 12.8 years ago by Louis Letourneau ▴ 820

0

Entering edit mode

I kind of mixed in two answer in one post. I hope it's clear enough.

ADD REPLY • link 12.8 years ago by Louis Letourneau ▴ 820

0

Entering edit mode

Good to know. Thank-you. The Eland alignment isn't used much downstream, but people apparently still want it done as part of the pipeline for a baseline alignment check. Also Illumina apparently asks for output from it some times.

We haven't tried the SAM/BAM conversion yet, but it will be interesting if having the files in BAM format will make the alignment more useful to anyone.

Thanks again

ADD REPLY • link 12.8 years ago by Jim Vallandingham ▴ 350

score 1 · Answer 2 · 2011-08-04

1

Entering edit mode

12.7 years ago

Nick Loman ▴ 610

For what it's worth I wrote a script to remove unfiltered reads from pairs of reads coming out of CASAVA 1.8. It's in Python and not super-fast, but perhaps useful for you. The script is available at Github. I use a large value for --fastq-cluster-count, e.g. 1000000000 so that you get a single file per pair.

ADD COMMENT • link 12.7 years ago by Nick Loman ▴ 610

0

Entering edit mode

Thanks. Our version of the filter is also up on Github . We use zcat to decompress the fastq file on the fly, so the filter need only worry about rejecting reads. Its great to see how one might handle decompression and filtering at the same time. Do you use ELAND for alignment?

ADD REPLY • link 12.7 years ago by Jim Vallandingham ▴ 350

0

Entering edit mode

No, I don't use ELAND. We mainly do bacterial stuff so it goes into an assembly pipeline next. My script deals with pairs because I was worried that one file would have a filtered read and another wouldn't and you'd end up with an asymmetric set of files, but this doesn't seem to be the case in practice.

ADD REPLY • link 12.7 years ago by Nick Loman ▴ 610

score 1 · Answer 3 · 2011-08-04

1

Entering edit mode

12.7 years ago

Swbarnes2 ★ 1.6k

Am I the only person who uses all the reads, filtered and unfiltered? Because I started working on microbial projects, where alignment time is negligible, I wanted to be safe and leave them in. Isn't its mapping or not the truer test of its quality? In my work, I'm mostly looking for SNPs, and would rather have some false positives than miss a real one. Is filtering out the non-passing reads standard for other projects (RNA-seq, exome capture) standard?

ADD COMMENT • link 12.7 years ago by Swbarnes2 ★ 1.6k

0

Entering edit mode

I'm new to the field, but it is my understanding that illumina did not previously include filtered reads in the sequence files. We took a vote and few of our analysts really cared about these low-quality reads. So, we are filtering them for now - to maintain how things worked before. In the future, we might change to leaving them in, if there is ever a reason to do this.

ADD REPLY • link 12.7 years ago by Jim Vallandingham ▴ 350

0

Entering edit mode

I agree. If there was some compelling reason like I really needed some extra coverage I might include the filtered reads, but usually we have more coverage than we need for our samples so I'd rather have the high quality ones.

ADD REPLY • link 12.7 years ago by Nick Loman ▴ 610

score 1 · Answer 4 · 2012-03-09

1

Entering edit mode

12.1 years ago

Tolyar ▴ 10

The correct command is:

for fastq in *.fastq.gz ; do zcat $fastq |sed -n -e '/^\@.* [^:]*:N:[^:]/ {N;N;N;p}' | gzip -5 -c >filtered/$fastq ; done

It removes filtered reads and pack the file again.

You can install CASAVA 1.8.2 where this problem is fixed.

ADD COMMENT • link 12.1 years ago by Tolyar ▴ 10