Tutorial: MutScan: Detect important mutations by scanning FastQ files directly
gravatar for chen
12 months ago by
Strange Tools: https://github.com/OpenGene
chen970 wrote:

MutScan (https://github.com/OpenGene/MutScan)

  • Ultra sensitive
  • 20X+ faster than normal pipeline (i.e. BWA + Samtools + GATK/VarScan/Mutect)
  • Very easy to use. Need nothing else. No alignment, no reference assembly, no variant call, no pileup...
  • Beautiful HTML report
  • Multi-threading support
  • Support both single-end and pair-end data
  • For pair-end data, MutScan will try to merge each pair, and do quality adjustment and error correction


# download use http

# or download use git
git clone https://github.com/OpenGene/MutScan.git


cd MutScan


usage: mutscan -1 <read1_file> -2 <read2_file> -m <mutation_file> -h <html_report_file> -t <thread>  
  -1, --read1       read1 file name (string)
  -2, --read2       read2 file name (string)
  -m, --mutation    optional, mutation file name (string)
  -h, --html        optional, filename of html report, no html report if not specified (string)
  -?, --help        print this message
  -t, --thread      thread number, default 4 (int)

The plain text result, contains the detected mutations and their support reads, will be printed directly. You can use > to redirect output to a file, like:

mutscan -1 <read1_file_name> -2 <read2_file_name> -m <mutation_file_name> > result.txt

And you can make a HTML file report with -h argument, like:

mutscan -1 <read1_file_name> -2 <read2_file_name> -m <mutation_file_name> -h report.html

single-end and pair-end

For single-end sequencing data, -2 argument is omitted:

mutscan -1 <read1_file_name> -m <mutation_file_name>

Mutation file

A CSV file with columns of name, left_seq_of_mutation_point, mutation_seq and right_seq_of_mutation_point

#name, left_seq_of_mutation_point, mutation_seq, right_seq_of_mutation_point

A default CSV file contains important actionable cancer gene targets is already provided in mutation/cancer.csv. If you want to use this mutation file directly, the argument mutation_file_name can be omitted:

mutscan -1 <read1_file_name> -2 <read2_file_name>

HTML output

If -h or --html argument is given, then a HTML report will be generated, and written to the given filename. A sample report is given here:

The color of each base indicates its quality, and the quality will be shown when mouse over.

ADD COMMENTlink modified 11 months ago • written 12 months ago by chen970

Cool. How does this work? Are you doing some kind of fuzzy k-mer alignment to target genes?

ADD REPLYlink written 12 months ago by Damian Kao14k

Yes. Basically this is an implementation of sequence string searching algorithm. But with support of error tolerance, quality handling and other sequence related features.

ADD REPLYlink written 12 months ago by chen970

Am I correct in assuming that this tool is designed for human samples only?

ADD REPLYlink written 12 months ago by harold.smith.tarheel3.8k

No, you can specify any sequence in the mutation list CSV file.

ADD REPLYlink written 12 months ago by chen970

Can we use this for RNAseq data?

ADD REPLYlink written 12 months ago by Ron600

Sure, it is just sequence. But protein sequence is not supported yet.

ADD REPLYlink written 11 months ago by chen970
gravatar for Dan Gaston
12 months ago by
Dan Gaston6.8k
Dan Gaston6.8k wrote:

If you're looking for some feedback:

1) Output a VCF file. You can also output a CSV file, or have them as selectable options, but at the very least give the option of outputting a properly formatted VCF file. If you want to gain traction as a tool you need to conform to widely adopted standards and fit into people's workflows.

2) Your CSV file, jamming all of the extra annotation info into a name field can be useful as a shortcut, and I can see the appeal for it creating a unique ID, but it also makes it less useful in the end, particularly if I'm dealing with a CSV file with potentially a large number of variants in it. If I'm going to load that into an excel file I want the info in that name field you have to be separate columns. And while I can do splits on additional characters when I import the data that creates an extra unnecessary step and makes sharing the raw CSV file with less computationally savvy colleagues less appealing.

3) The HTML report looks nice but doesn't seem to highlight the "TestMutation" very obviously, at least when I glance at the report.

4) If the mutation file is optional and it doesn't use any sort of reference, is the extra info about the mutation that you have in the name field added? I'm assuming it is using this mutation file as the "reference" for annotation?

Otherwise this looks interesting and I have a bunch of data to test it on.

ADD COMMENTlink written 12 months ago by Dan Gaston6.8k

Thanks, good suggestions.

ADD REPLYlink written 12 months ago by chen970
gravatar for John
12 months ago by
John11k wrote:

I'm very skeptical that this will perform anywhere near as well as a 'normal' SNP calling pipeline. I don't doubt i'm probably missing something though. Maybe a brain. Can you confirm that:

  1. Adapters and poor-quality bases will be removed and not called as variants, even though the adapter sequence isn't known to your algorithm?
  2. The optical/pcr duplicates will be marked and discarded?
  3. All the usual post-mapping work that informs SNP calling (such as realignment around indels) is done, or is not an issue due to the way you make your graphs?
  4. You can detect frameshift mutations anywhere in the gene body without specifying exactly what that mutation should look like.
  5. How can you produce VCFs without mapping to the genome? I suppose you cannot. Which suggests this would be incompatible with all the other down-stream variant calling tools out there.

Personally I think the 20x faster claim is like comparing apples to oranges. My gut-instinct is to think there's no way this approach, with less information about the genome, can possibly detect variants as well as the 'normal' method. Worse, I suspect it will be used inappropriately by people looking to cut corners.

On the positive side, I suppose to even check 1 entry in the CSV file, you have to build the entire graph. If you could write-out the graph after building it, I can see a number of time-memory tradeoff techniques could be developed in the future. This would also be the go-to tool for looking at variants in organisms without sequenced/annotated genomes. Although it's debatable how your CSV file would look on a organisms without a sequenced genome...

ADD COMMENTlink modified 11 months ago • written 12 months ago by John11k

Thanks for your comments.

Actually this tool is used to detect low frequency mutations in deep sequencing, it is not a complete, or even a tiny variant caller.

1, Adapters is not processed here, but it actually doesn't affect the result. Quality is handled, low quality reads will be filtered.
2, No deduplication implemented. But the result HTML report can show the duplicates.
3, Post mapping work is not needed, becuase this tool doesn't use the alignment information.
4, Frameshift, or small indel is able to be handled, just provide the sequence after indel in the CSV.
5, No VCF generation.

ADD REPLYlink modified 12 months ago • written 12 months ago by chen970
gravatar for harold.smith.tarheel
11 months ago by
United States
harold.smith.tarheel3.8k wrote:

Since I didn't get a response to my comment/questions, I'll repost as an answer:

1) I can envision how the tool might add annotations to variants that are listed in the optional mutation file. But how does it treat de novo mutations, or cases where the mutation file is not provided?

2) Absent a reference and mutation file, how do you call a homozygous variant?

3) You state that the tool is useful for ultra-low frequency mutations. How do you discriminate those from common sequencing errors?

ADD COMMENTlink written 11 months ago by harold.smith.tarheel3.8k

This tool is not a variant caller. It just helps eliminating the false negatives for low frequency mutation detection.
Some applications, like circulating tumor DNA sequencing, the mutated reads are usually very few, and may be not detected by normal pipelines.
In this case, this tool can scan the important mutation locus (like EGFR L858R, which makes patients sensitive to EGFR TKI treatment) to check for false negative.
If you have experience with ctDNA sequencing, you may understand why I developed this.

ADD REPLYlink modified 11 months ago • written 11 months ago by chen970
gravatar for Asaf
12 months ago by
Asaf4.4k wrote:

I'm wondering if this approach can be adapted to detect closely-related strains in metagenomics samples. If, for instance, you have a region of a relatively conserved gene you will see some variation in it when you have close strains or species. Since mapping is usually not an option and assembly would probably miss these kind of variations, using such a tool will give very important input for how to assemble the sample (error tolerance etc.) Any thoughts?

ADD COMMENTlink written 12 months ago by Asaf4.4k
gravatar for chen
12 months ago by
Strange Tools: https://github.com/OpenGene
chen970 wrote:

This tool can be very useful for cancer somatic mutation detection, especially for detecting ultra-low frequency mutation from deep sequencing data.

This tool can be used directly in liquid biopsy, like ctDNA sequencing.

ADD COMMENTlink written 12 months ago by chen970

Yes, in theory it works, but in practice you need to prove it works to high accuracy and works better than other alternatives.

ADD REPLYlink written 11 months ago by lh330k


We're just testing it with ~1000 cfDNA samples. I will update the result once it's done.

ADD REPLYlink written 11 months ago by chen970

Hi Chen,

Were you able to test out 1000 cfDNA samples? I would be very interested in knowing the answer.


ADD REPLYlink written 6 weeks ago by caspase8mach0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 655 users visited in the last hour