Blog:NGS one-liner to call variants
0
6
Entering edit mode
6 months ago
barslmn ★ 2.1k

Edit 24 Oct 2023: I use nextflow in projects. This is just a curiosity and should be treated as just experimentation. Take a look at great advice from experienced users below.

This is a tutorial about creating a pipeline for sequence analysis in a single line. It is made for capture/amplicon short read sequencing in mind for human DNA and tested with reference exome sequencing data described here. I share the process and debuging steps gone through while putting it together. Purpose of this exercise is to investigate shell pipes and is purely educational.
Source is available at: https://github.com/barslmn/ngsoneliner/
I couldn't make a longer post, complete version of this post: https://omics.sbs/blog/NGSoneliner/NGSoneliner.html

Pipeline

# fastp --in1 "$R1" --in2 "$R2" --stdout --html $OUTPUT.fastp.html --json $OUTPUT.fastp.json 2>$OUTPUT.$$.fastp.log |
bwa mem -t "$THREADS" -R "@RG\tID:$SAMPLE\tSM:$SAMPLE\tPL:illumina\tLB:lib1\tPU:foo" "$REFERENCE" "$R1" "$R2" 2>"$OUTPUT.$$.bwa.log" |
    # pv -cN bwa -s "$(gzip -l "$R1" "$R2" | awk '{print $2}' | tail -n1)" |
    samtools collate -@ "$THREADS" -O - |
    samtools fixmate -@ "$THREADS" -m - - |
    samtools sort -@ "$THREADS" - 2>"$OUTPUT.$$.samtools.log" |
    # pv -cN samtools_sort |
    ([ "$AMPLICON" = "NO" ] && samtools markdup -@ "$THREADS" - - || cat) |
    # pv -cN samtools_markdup |
    samtools view -C -T "$REFERENCE" - |
    tee "$OUTPUT.cram" |
    bcftools mpileup --threads "$THREADS" -Ou -A -T "$TARGET" -d 10000 -L 10000 -a "FORMAT/AD,FORMAT/DP" -f "$REFERENCE" - 2>>"$OUTPUT.$$.bcftools.log" |
    # pv -cN bcftools_mpileup |
    bcftools call --threads "$THREADS" -Ou --ploidy "$ASSEMBLY" -mv |
    # pv -cN bcftools_call |
    bcftools view -i 'FORMAT/DP>5&&QUAL>5' |
    bcftools norm --threads "$THREADS" -Ou -m-any --check-ref w -f "$REFERENCE" 2>>"$OUTPUT.$$.bcftools.log" |
    bcftools +fill-tags -Ou -- -t all 2>>"$OUTPUT.$$.bcftools.log" |
    bcftools +setGT -Ou -- -t q -n c:'0/1' -i 'VAF>=.1' 2>>"$OUTPUT.$$.bcftools.log" |
    bcftools +setGT -Ov -- -t q -n c:'1/1' -i 'VAF>=.75' 2>>"$OUTPUT.$$.bcftools.log" |
    /home/bar/ensembl-vep/vep --everything --force_overwrite --vcf --pick --format vcf \
        --fork $THREADS \
        --stats_file "$OUTPUT"_summary.html \
        --warning_file "$OUTPUT"_warnings.txt \
        --output_file STDOUT --cache 2>"$OUTPUT.$$.vep.log" |
    # pv -cN vep |
    bcftools +split-vep -c SYMBOL,gnomADg_AF:Float,IMPACT,Existing_variation 2>>"$OUTPUT.$$.bcftools.log" |
    bcftools filter --threads "$THREADS" -Ou -m+ -s 'onTarget' -M "$TARGET" |
    bcftools filter --threads "$THREADS" -Ou -m+ -s 'lowQual' -g3 -G10 -e 'FORMAT/DP<=15 || QUAL<=20' |
    bcftools filter --threads "$THREADS" -Ou -m+ -s 'highFreq' -e 'gnomADg_AF>0.001' |
    bcftools filter --threads "$THREADS" -Ou -m+ -s 'lowIMPACT' -i 'IMPACT~"HIGH" || IMPACT~"MODERATE"' |
    bcftools filter --threads "$THREADS" -Ou -m+ -s 'HOMrare' -e 'GT="1/1" && (gnomADg_AF <= 0.001 || (Existing_variation="." && gnomADg_AF="." && ID="."))' |
    bcftools filter --threads "$THREADS" -Ob -m+ -s 'HETnovel' -e 'GT="0/1" && Existing_variation="." && gnomADg_AF="." && ID="."' |
    tee "$OUTPUT.bcf" |
    bcftools +split-vep -f "$columns" -d -i 'CANONICAL~"YES"' 2>>"$OUTPUT.$$.bcftools.log" |
    awk -v header="$header" 'BEGIN {printf  header} 1' |
    gzip -c >"$OUTPUT.tsv.gz"

Breakdown

  1. Preprocessing fastq

    fastp with the --stdout parameter writes interleaved output. We are just using default operations. Because of different results its commented out and not being used.

  2. Alignment

    bwa can take interleaved input from stdin with `-p` option. This was used to get the input from fastp stdout. We are just starting from bwa by giving the both fastq files as inputs.

  3. Processing aligned reads

    Samtools collate, fixmate, sort, markdup submodules are used. We use samtools to mark duplicate reads. http://www.htslib.org/doc/samtools-markdup.html markdup has a conditional check beforehand for a variable named AMPLICON which if true skips the markdup and just cats the alignments.

  4. Writing out the alignment file

    samtools view is used to convert the bam file to cram. We can use tee to write out the cram file while passing it down the pipeline.

  5. Variant calling

    mpileup has -d 10000 -L 10000 parameters in case there are high depth amplicon sequencing. Beware the 1.3.4 commands before the call.

  6. Post processing variants

    Check out this section to understand rationale behind this view command. Variant normalization is performed by bcftools norm. In fill-tags command the specific tag I need is the variant allele fraction (VAF). We later use the VAF with setGT heterozygous and setGT homozygous commands.

    We are using -Ov before the VEP command because I had problem with VEP reading compressed from stdin. https://github.com/Ensembl/ensembl-vep/issues/1502

  7. annotating variants

    A bare bone vep command. This brings a lot of annotation with the --everything= flag.

  8. filtering variants

    Here we use split-vep and add some of the columns we’re interested in into the INFO column before filtering. Filtering expressions depends on what we’re trying to achieve, I just put the ones I like. 28 29 30 31 32 33

    Here no variant is excluded but the tags we give with the -s parameters are appended to the FILTER column. After the filtering we write bcf with the tee command.

  9. writing out a tsv file

    We define the columns we want to have in the final table columns and use split-vep to format it into a table. We can add these columns as the first line with awk after we format it as a header. We can than use gzip to keep it compressed.

Results and highlights

As result this report is created with multiqc. It would be better make a hard filtered vcf and create the report that one but since this test already has small number of variants it wasn’t required.

  • Test the command by running it multiple times.
  • Count the data in intermediary steps (md5sum, mapped and paired reads, number of variants etc.).
shell sequencing • 1.2k views
ADD COMMENT
3
Entering edit mode

I would be wary of this for a few reasons.

  1. Forcing a 'pipeline' into literally a long pipe is very fragile; if anything goes wrong anywhere, decoding the error messages is a nightmare.
  2. It uses vastly more memory than is actually necessary for the job.
  3. There are a lot of questionable stages that seem highly customized to your specific project. If they were universally applicable they would have been baked into bcftools as defaults that could be used in a single pass.
  4. It doesn't actually run significantly faster, since multithreaded stages already use all available resources, and some pipeline stages can't even start until the previous one finishes.
  5. You lose the intermediate results from mapping/alignment, which is the slow step. So if you want to re-call variants later with different settings, you won't be able to.
  6. If you actually wanted to call variants quickly in one line, you would get better results faster doing something like this:

    bbmap.sh in=preprocessed_reads.fq ref=ref.fa out=mapped.sam nodisk; callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf

Of course you can pipe those too if you want, but it just wastes memory without actually running substantially faster.

ADD REPLY
0
Entering edit mode

Thanks a lot for your feedback. The main idea is to how much I can push the shell pipes and showing some approaches to problems when they arise.

Forcing a 'pipeline' into literally a long pipe is very fragile; if anything goes wrong anywhere, decoding the error messages is a nightmare.

Yes, it can be fragile, and I spend way too much time debugging this more than I like to admit. Here is one such case.

It uses vastly more memory than is actually necessary for the job.

I didn't compare the memory usage to running the command distinctly but I keep track of cpu and memory usages and create a plot (example here). It didn't jump to me as using too much memory. Would it be using more memory that it's not tracked by processes memory usage?

There are a lot of questionable stages that seem highly customized to your specific project. If they were universally applicable they would have been baked into bcftools as defaults that could be used in a single pass.

Yes, there are a lot of specific filters or commands with setGT for instance. This is certainly not something one can copy and paste and use as is.

It doesn't actually run significantly faster, since multithreaded stages already use all available resources, and some pipeline stages can't even start until the previous one finishes.

I only looked in to running time when there was something wrong that it took way longer than it supposed to. And main concern was not making it slower than running the tools in distinct steps. Multiple stages is one of the first concerns and needs extra attention which is discussed here.

You lose the intermediate results from mapping/alignment, which is the slow step. So if you want to re-call variants later with different settings, you won't be able to.

We can write out any intermediate results with tee commands. Which is also useful for debugging purposes. If any speed gain there is to it, I guess would come here, since it starts the variant calling while writing the alignment file to the disk.

If you actually wanted to call variants quickly in one line, you would get better results faster doing something like this:

bbmap.sh in=preprocessed_reads.fq ref=ref.fa out=mapped.sam nodisk; callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf

I have seen bbmap a lot, unfortunately haven't got the chance to try it out. I would like to use it in a future project.

ADD REPLY
2
Entering edit mode

Yes it can be fragile, and I spend way too much time debugging this more than I like to admit.

That's all you need. You're caught up in sunken costs. Abandon ship and learn workflow stuff like snakemake or Nextflow. This script is bad and it is wasting your time.

We can write out any intermediate results with tee commands.

As can we with custom file descriptors. That is not the point. Unix has a simple philosophy of each tool doing one thing really well, and your "pipe" ruins that philosophy.

ADD REPLY
0
Entering edit mode

I just had the idea that seemed interesting and seen it through the end. I use and advise for workflow managers in projects.

ADD REPLY
2
Entering edit mode

This is a tutorial about creating a pipeline for sequence analysis in a single line.

While I applaud your contribution this line does not tell the user about what kind of "sequence analysis" is being done. NGS one liner to call variants may be a more apt title? Certainly a strong code golf candidate (after looking at the longer post linked above).

That observation about fastp is also strange. In what way are the three outputs different? Do you get different read numbers? Order of reads would probably be different if you use more than one thread.

ADD REPLY
0
Entering edit mode

Thanks a lot for your feedback.

Purpose of the exercise is to how much I can build with just using shell pipes. I did show this to a someone who have some acquaintance to linux and bioinformatics but not building pipelines. However, our study was not on calling variants and we just use alignment and not the other parts. So, even though I don't suggest using this as a whole, parts of it can be of use.

While I applaud your contribution this line does not tell the user about what kind of "sequence analysis" you are referring to here.

Absolutely, it should've been made clear and was an oversight. It is made for capture/amplicon short read sequencing in mind for human DNA and tested with reference exome sequencing data described here.

NGS one liner to call variants may be a more apt title?

I had gone through some different titles and gone with a catchy one but being bit more descriptive would be more helpful and will change the title.

That observation about fastp is also strange. In what way are the three outputs different? Do you get different read numbers? Order of reads would probably be different if you use more than one thread.

Yes, threading was the first thing I checked but I was running it single threaded. Fastp fixed earlier randomness problems with threading. I haven't investigated further in since scope of the post was to get the pipeline running.

ADD REPLY
1
Entering edit mode

I think people should start chaining commands together in shell scripts, then progress to something like snakemake. This approach is a giant step backwards - Brian has raised a number of valuable points; I'll give you my gut reaction:

No, please. Noooooooooooo! You're a respected user on the forum, please don't guide people down a bad path. Beginners will copy paste your code with no idea of set -o pipefail and say "didn't work" when the command waits for completion on a long step. Things you can pipe are like samtools sort - > samtools view -C -T ... -o xyz.cram to go from unsorted BAM to CRAM. The pipe you've written is too intense to be a single command.

ADD REPLY
1
Entering edit mode

It certainly has a surprising effect and started a discussion I learned from. You're correct about misguiding beginners, I have taken your advice and change the post type to blog and added a disclaimer to the top.

ADD REPLY
1
Entering edit mode

That's a nice disclaimer - this script is a nice plaything for people who know how to get stuff done, but it def is a quagmire for beginners.

ADD REPLY

Login before adding your answer.

Traffic: 1619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6