Automation And Post Processing Casava 1.8 Output
4
6
Entering edit mode
10.9 years ago

Moving from previous versions of Illumina's CASAVA pipeline to CASAVA 1.8, there are a number of changes in terms of input, output and file structure.

While some of these new 'features' are meant to make life easier for post-pipeline analysis (like the new project / sample directory structure), other less-heralded changes look like they were done, in my opinion, to make life easier for Illumina and not Illumina's customers.

My question is for those groups that have moved up to CASAVA 1.8 (or are looking at doing so) what kind of post processing do you do to the data files generated by this pipeline? Have you found any 'gotchas' in terms of changes new to 1.8? Are you automating the running of this pipeline and post-run functionality?

From A previous BioStar question, and from what I've heard, most people don't just use the stock CASAVA pipeline for all their analysis. But I have not seen much in terms of how people are wrangling the pipeline's awkward and ever-changing folder structure and output formats.

Specifically, from what I've seen, there are a few new changes in 1.8 that demand some post-run attention:

• The fastq files for a specific lane / index are broken up based on file size and so must be concatenated back together.
• As the directory structure for how these fastq files are organized is now provided by the user, care needs to be taken to ensure you are concatenating the right files (in the right order?).
• The reads that don't pass filter are still present in the unaligned fastq files.
• As I understand it, this was not the case previously - and so post-run analysis now needs to be aware of and deal with these bad reads, or they need to be removed before post-run analysis begins.
• The CASAVA documentation suggests using zcat and grep to exclude filtered reads. Here's the bash script they suggest:
    cd /path/to/project/sample
mkdir filtered
for fastq in *.fastq.gz ; do zcat $fastq | grep -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq
; done

• But this will produce invalid output (at least with our version of grep). Some problems with code this include:

• -A prints additional trailing lines. So you want 3 trailing lines in addition to the first line matched
• -A also adds a "--" between contiguous matches, so you need to remove those
• You are redirecting this to a file named [something].fastq.gz - but because you just expanded it with zcat it is no longer a .gz file!
• This causes tools that use the file name to determine operations (like fastqc) to fail.

I highlight this change to show the errors that can be introduced with this kind of post-processing, which is one of the reasons I'm asking if others are running into these kinds of issues.

So what are people doing about this? Are proprietary LIMS systems already on the ball and all these kinks are worked out? Are these changes too trivial to be mentioned publicly? Is no one else moving to CASAVA 1.8 yet?

Thanks
Jim

illumina pipeline next-gen sequencing • 7.4k views
0
Entering edit mode

+1, I was just to ask a similar question just now ;-)

0
Entering edit mode

As an aside, here is a broad overview of the current automated pipeline we have with 1.8: 1) Create config.txt and SampleSheet.csv from LIMS data 2) Convert raw unaligned reads to Fastq format and Demultiplex lanes 3) Perform alignment to reference genome with ELAND 4) Aggregate and rename unaligned reads 5) Remove reads that do not pass filter 6) Analyze unaligned reads using fastqc 7) Aggregate and rename export files 8) Distribute data to project directories 9) Distribute stats and quality control analysis to permanent storage

2
Entering edit mode
10.9 years ago

Well we actually only use casava for the fastq generation and we used to use it to align to phiX for to get the summary file. (We spiked 1% phiX in all the lanes).

Now we use SAV to check the phiX part since HCS does this alignment on the fly.

So what's left is only the demultiplexing. Doing this directly from the bcls is wonderfull for use.

Also you can configure the demux step to not breakup the files. Which we did. The only thing that's left is having passed filtered reads or not.

We used our own 'implementation' (modified grep command :-) ) to do this.

Overall other than minor tweaks to our lims to look at files in different places, CASAVA 1.8 simplified our pipeline.

Also since it starts directly form bcls, it's quite a bit faster than the bcl->qseq->demux->align/fastq->gzip pipeline.

now it's

bcl->demuxed, compressed fastqs

wonderful.

0
Entering edit mode

Good to here. I'm not saying that some of the changes aren't good, or useful (especially the use of bcl's directly!).

For Not splitting up the fastq files - are you referring to the --fastq-cluster-count parameter?

I was concerned about messing with that too much, because of the danger of 'silently failing' when mis-configured. Might be less of a problem if you aren't using Eland for alignment.

I too wrote an 'implementation' (zcat + ruby) to filter reads.

What do you have this set to?

Thanks again.

0
Entering edit mode

"hear" not "here".

Anyone have experiences where ELAND is used in conjunction with external tools?

0
Entering edit mode

Yes that's the parameter I used.

If you're not using eland, it's not a problem to set this high. We set it to max signed int 2 billion or so.

Well now eland can generate bams, so I guess any of the normal tools would work. Unless they don't set some things in the bam (mapping quality and such).

I've never used eland's output downstream

0
Entering edit mode

I kind of mixed in two answer in one post. I hope it's clear enough.

0
Entering edit mode

Good to know. Thank-you. The Eland alignment isn't used much downstream, but people apparently still want it done as part of the pipeline for a baseline alignment check. Also Illumina apparently asks for output from it some times.

We haven't tried the SAM/BAM conversion yet, but it will be interesting if having the files in BAM format will make the alignment more useful to anyone.

Thanks again

1
Entering edit mode
10.8 years ago
Nick Loman ▴ 610

For what it's worth I wrote a script to remove unfiltered reads from pairs of reads coming out of CASAVA 1.8. It's in Python and not super-fast, but perhaps useful for you. The script is available at Github. I use a large value for --fastq-cluster-count, e.g. 1000000000 so that you get a single file per pair.

0
Entering edit mode

Thanks. Our version of the filter is also up on Github . We use zcat to decompress the fastq file on the fly, so the filter need only worry about rejecting reads. Its great to see how one might handle decompression and filtering at the same time. Do you use ELAND for alignment?

0
Entering edit mode

No, I don't use ELAND. We mainly do bacterial stuff so it goes into an assembly pipeline next. My script deals with pairs because I was worried that one file would have a filtered read and another wouldn't and you'd end up with an asymmetric set of files, but this doesn't seem to be the case in practice.

1
Entering edit mode
10.8 years ago
Swbarnes2 ★ 1.5k

Am I the only person who uses all the reads, filtered and unfiltered? Because I started working on microbial projects, where alignment time is negligible, I wanted to be safe and leave them in. Isn't its mapping or not the truer test of its quality? In my work, I'm mostly looking for SNPs, and would rather have some false positives than miss a real one. Is filtering out the non-passing reads standard for other projects (RNA-seq, exome capture) standard?

0
Entering edit mode

I'm new to the field, but it is my understanding that illumina did not previously include filtered reads in the sequence files. We took a vote and few of our analysts really cared about these low-quality reads. So, we are filtering them for now - to maintain how things worked before. In the future, we might change to leaving them in, if there is ever a reason to do this.

0
Entering edit mode

I agree. If there was some compelling reason like I really needed some extra coverage I might include the filtered reads, but usually we have more coverage than we need for our samples so I'd rather have the high quality ones.

1
Entering edit mode
10.2 years ago
Tolyar ▴ 10

The correct command is:

for fastq in *.fastq.gz ; do zcat $fastq |sed -n -e '/^\@.* [^:]*:N:[^:]/ {N;N;N;p}' | gzip -5 -c >filtered/$fastq ; done

It removes filtered reads and pack the file again.

You can install CASAVA 1.8.2 where this problem is fixed.