Tool: elPrep 4.0.0, a high-performance drop-in replacement tool for GATK4/Picard/SAMtools for processing SAM/BAM files
8
gravatar for Charlotte.Herzeel
8 months ago by
Belgium/Leuven/Imec
Charlotte.Herzeel140 wrote:

Dear colleagues,

We are happy to announce the release of elPrep 4.0.0, an open-source, drop-in replacement tool for GATK4/Picard/SAMtools for preparing SAM/BAM files for variant calling that produces identical results, while greatly improving computational performance. For more details, see the elprep github repository.

elPrep 4.0.0 introduces multiple new features allowing us to process the preparation steps defined by the GATK Best Practices for variant calling.

New features include:

  • added base quality score recalibration (BQSR)
  • added optical duplicate marking
  • added metrics (MultiQC compatible)
  • support for SAM File Format version 1.6
  • support for FASTA and VCF files
  • support for elPrep-specific elsites and elfasta formats
  • split/filter/merge (sfm) mode now implemented in Go instead of Python
  • added --log-path option to all tools
  • various API and performance improvements
  • changed license to the GNU Affero General Public License version 3 as published by the Free Software Foundation, with Additional Terms
  • updated demos

Our benchmarks show that elPrep 4.0.0 executes the sort/deduplicate/recalibrate and apply-BQSR-pipeline from the GATK Best Practices up to 12x faster for WES data and 7.5x faster for WGS data, while utilising similar or fewer compute resources than Picard/GATK4.

Example runtime, RAM use, and disk use for 50x WGS Illumina Platinum Genome NA12878 aligned against hg38. elPrep combines the execution of the 4 pipeline steps for efficient parallel execution.

enter image description here

We are looking forward to your feedback and suggestions.

Thanks a lot!

Kind regards,

Charlotte Herzeel, Exascience Life Lab, Imec, Belgium

tool bqsr markduplicates sort sam bam • 1.2k views
ADD COMMENTlink modified 6 weeks ago • written 8 months ago by Charlotte.Herzeel140
2

Hi, this is a great tool! I feel you forgot to mention elPrep's modularity by design. Adding/removing filters to our elPrep call to suit our pipeline needs is done in a breeze. This allows us to use it in all sorts of NGS pipelines, and not just the GATK's Best Practices for variant calling. I also really like that -- with a bit of work -- it's not extremely difficult to add new filters to suit our needs. Plus, you guys have always been very responsive to such requests. This is a very efficient and valuable tool for the community. Thanks!

ADD REPLYlink written 8 months ago by Leonor Palmeira3.7k
1

Thanks! elPrep is indeed designed as a modular plug-in architecture where the implementation of SAM/BAM tools is separated from the engine that parallelises and merges their execution. We have extensive documentation and very much welcome contributions and suggestions for extending elPrep to support different sequencing pipelines!

ADD REPLYlink written 8 months ago by Charlotte.Herzeel140

Thanks for the API documentation link!

ADD REPLYlink written 8 months ago by Leonor Palmeira3.7k

Hi, I tried this some time ago, and found it made significant assumptions about read names. I.e. data from the SRA or non-illumina sequencers could not be processed. Have these requirements been relaxed in the meantime ?

ADD REPLYlink written 8 months ago by colindaven1.6k
1

Hi, We only make assumptions about the read names (QNAME) for optical duplicate marking, as they have to encode the tile + coordinates. Is this what you mean? If not, could you provide more details, e.g. the error message you get? Thanks!

ADD REPLYlink written 8 months ago by Charlotte.Herzeel140
1

If one does not fastq-dump data from SRA with -F or --origfmt option then one ends up with fastq headers that replace the standard Illumina headers with something that look like this.

@SRR7716298.4 4 length=100
CTGCAATAAGAGCTCGATGTCATTATGTTAAGAAAAAATGGCTCGGAGGTATGGGAACGAAGTGGTATACTACAGAAACGAGACTTCGTAAGTTCAGGTA
+SRR7716298.4 4 length=100
AAAFFJFJJJJAJ<F-F7FFF<-77-7<-7----FF<<77F7AJAJ7JJJJF7AAA<J<-7-<AA-A77F7-AJJ-<A-AJFJJ--<F7AAA-<7A-F77

I believe that is what @colindaven is referring to. Then there are probably headers from other technologies that don't follow the Illumina format.

ADD REPLYlink written 8 months ago by genomax69k
2

Yes, you are right. We have seen the same problem. elPrep currently only supports the Illumina format for optical duplicate marking, which is what GATK4 also supports by default. If you would like us to support other formats, please submit an issue on our github repository so we can discuss this in more detail. Thanks a lot.

ADD REPLYlink modified 8 months ago • written 8 months ago by Charlotte.Herzeel140
1

(Edited: solved) looks like ePrep works in all other cases where there is no optical duplication

there is a lot of data in SRA where one cannot recover the original read formatting even if these were originally produced on that instrument.

ADD REPLYlink modified 8 months ago • written 8 months ago by Istvan Albert ♦♦ 80k
1

Nonetheless, it's a nice tool and it's great people are trying to speed up bioinformatics infrastructure akin to what is going in the commercial and semi-commercial world with DRAGEN, MPEG-G and so on. So I will test a bit on Illumina X10 and NextSeq data I have and give feedback on any bugs I encounter.

ADD REPLYlink written 8 months ago by colindaven1.6k
1

Hi,

I am a bit confused about your remark. We tested elPrep a lot, including on data from SRA archives, but apart from optical duplicate marking, we haven’t encountered any issues because of QNAME fields. When elPrep is not able to recover tile information from the QNAME fields, it will skip optical duplicate marking and log a warning. Any other commands in the elPrep call should continue executing without problems.

There are two other places where elPrep code refers to QNAME fields. One is when sorting reads by queryname. The other is for correlating the two ends of a pair during duplicate marking, and for resolving ties when duplicates have the same phred score. To the best of our knowledge, we are in both cases faithfully reproducing the behaviour of Picard and GATK. We think that even for optical duplicate marking, if Picard sees QNAME fields without tile information, it will also not be able to properly mark optical duplicates.

We are primarily software engineers, so it is certainly possible we may be missing something. If you could clarify what the issue is you are referring to, we are very happy to make an attempt at fixing it.

Thanks a lot for your help.

ADD REPLYlink written 8 months ago by Charlotte.Herzeel140
1

I was simply reacting to your statement above where you say:

elPrep currently only supports the Illumina format,

That is a much more restrictive statement than the second statement that you make in your reply:

but apart from optical duplicate marking, we haven’t encountered any issues because of QNAME fields

If you support an Illumina specific functionality (among the many others) that does not mean that the tool "only supports Illumina format". Frankly, there is not even such a thing as "Illumina format", it just happens that for the past few years the most popular instruments produced read names formatted in a certain way, but that is not really a format, nor did Illumina instruments always produced that format.

If the tool works fine on say PacBio data, other than optical marking (which would not even apply there anyway) then it is all good and it is a fair replacement for samtools.

ADD REPLYlink modified 8 months ago • written 8 months ago by Istvan Albert ♦♦ 80k

The original elPrep paper describes the sorting and duplicate marking implementations.

Is there a paper in preparation describing the BQSR implementation and the new features?

ADD REPLYlink written 8 months ago by Leonor Palmeira3.7k
1
gravatar for Charlotte.Herzeel
7 months ago by
Belgium/Leuven/Imec
Charlotte.Herzeel140 wrote:

We are happy to announce that a preprint of our new paper describing elPrep 4 is now available. See https://t.co/u6h12mQPhx

ADD COMMENTlink written 7 months ago by Charlotte.Herzeel140

The final version of our article was just published by PLOS One. See https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0209523

ADD REPLYlink written 5 months ago by Charlotte.Herzeel140
1

I read the paper. elPrep seems to be a very promising tool for post alignment sequence data processing. I am going to try it out on WES somatic (paired and tumor only) data. It is great that the GATK v4 best practices steps for post-alignment processing are included in a modular fashion in one single step. I am curious if you are considering including indel realignment in elPrep? Some variant callers now include the indel realignment step as part of variant calling but there are still some popular somatic variant callers that do not and benefit from indel realignment prior to variant calling. Currently, indel realignment options are very limited. Thanks!

ADD REPLYlink written 5 months ago by roysomak440

Thanks a lot for trying elPrep! We will look into indel realignment, but I can't promise we will implement this soon.

ADD REPLYlink written 5 months ago by Charlotte.Herzeel140

Would you mind telling us which variant callers you are using that need indel realignment? Thanks.

ADD REPLYlink written 4 months ago by Charlotte.Herzeel140
0
gravatar for Charlotte.Herzeel
6 weeks ago by
Belgium/Leuven/Imec
Charlotte.Herzeel140 wrote:

Our article where we compare C++, Java, and Go for implementing elPrep has just been published by BMC Bioinformatics. This article describes the advantages and challenges we encountered in these languages when implementing a SAM/BAM tool and motivates why we ended up choosing Go for elPrep.

See: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2903-5

ADD COMMENTlink written 6 weeks ago by Charlotte.Herzeel140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1548 users visited in the last hour