Tutorial: MutScan: Detect important mutations by scanning FastQ files directly
gravatar for chen
4.1 years ago by
chen2.1k wrote:

MutScan (https://github.com/OpenGene/MutScan)

  • Ultra sensitive
  • 20X+ faster than normal pipeline (i.e. BWA + Samtools + GATK/VarScan/Mutect)
  • Very easy to use. Need nothing else. No alignment, no reference assembly, no variant call, no pileup...
  • Beautiful HTML report
  • Multi-threading support
  • Support both single-end and pair-end data
  • For pair-end data, MutScan will try to merge each pair, and do quality adjustment and error correction


# download use http

# or download use git
git clone https://github.com/OpenGene/MutScan.git


cd MutScan


usage: mutscan -1 <read1_file> -2 <read2_file> -m <mutation_file> -h <html_report_file> -t <thread>  
  -1, --read1       read1 file name (string)
  -2, --read2       read2 file name (string)
  -m, --mutation    optional, mutation file name (string)
  -h, --html        optional, filename of html report, no html report if not specified (string)
  -?, --help        print this message
  -t, --thread      thread number, default 4 (int)

The plain text result, contains the detected mutations and their support reads, will be printed directly. You can use > to redirect output to a file, like:

mutscan -1 <read1_file_name> -2 <read2_file_name> -m <mutation_file_name> > result.txt

And you can make a HTML file report with -h argument, like:

mutscan -1 <read1_file_name> -2 <read2_file_name> -m <mutation_file_name> -h report.html

single-end and pair-end

For single-end sequencing data, -2 argument is omitted:

mutscan -1 <read1_file_name> -m <mutation_file_name>

Mutation file

A CSV file with columns of name, left_seq_of_mutation_point, mutation_seq and right_seq_of_mutation_point

#name, left_seq_of_mutation_point, mutation_seq, right_seq_of_mutation_point

A default CSV file contains important actionable cancer gene targets is already provided in mutation/cancer.csv. If you want to use this mutation file directly, the argument mutation_file_name can be omitted:

mutscan -1 <read1_file_name> -2 <read2_file_name>

HTML output

If -h or --html argument is given, then a HTML report will be generated, and written to the given filename. A sample report is given here:

The color of each base indicates its quality, and the quality will be shown when mouse over.

ADD COMMENTlink modified 2.8 years ago by lebedana210 • written 4.1 years ago by chen2.1k

Cool. How does this work? Are you doing some kind of fuzzy k-mer alignment to target genes?

ADD REPLYlink written 4.1 years ago by Damian Kao15k

Yes. Basically this is an implementation of sequence string searching algorithm. But with support of error tolerance, quality handling and other sequence related features.

ADD REPLYlink written 4.1 years ago by chen2.1k

Am I correct in assuming that this tool is designed for human samples only?

ADD REPLYlink written 4.1 years ago by harold.smith.tarheel4.6k

No, you can specify any sequence in the mutation list CSV file.

ADD REPLYlink written 4.1 years ago by chen2.1k

Can we use this for RNAseq data?

ADD REPLYlink written 4.1 years ago by Ron1.0k

Sure, it is just sequence. But protein sequence is not supported yet.

ADD REPLYlink written 4.1 years ago by chen2.1k

How open this program? I don't understand. Help me please.

ADD REPLYlink written 5 months ago by andrre15200
gravatar for DG
4.1 years ago by
DG7.1k wrote:

If you're looking for some feedback:

1) Output a VCF file. You can also output a CSV file, or have them as selectable options, but at the very least give the option of outputting a properly formatted VCF file. If you want to gain traction as a tool you need to conform to widely adopted standards and fit into people's workflows.

2) Your CSV file, jamming all of the extra annotation info into a name field can be useful as a shortcut, and I can see the appeal for it creating a unique ID, but it also makes it less useful in the end, particularly if I'm dealing with a CSV file with potentially a large number of variants in it. If I'm going to load that into an excel file I want the info in that name field you have to be separate columns. And while I can do splits on additional characters when I import the data that creates an extra unnecessary step and makes sharing the raw CSV file with less computationally savvy colleagues less appealing.

3) The HTML report looks nice but doesn't seem to highlight the "TestMutation" very obviously, at least when I glance at the report.

4) If the mutation file is optional and it doesn't use any sort of reference, is the extra info about the mutation that you have in the name field added? I'm assuming it is using this mutation file as the "reference" for annotation?

Otherwise this looks interesting and I have a bunch of data to test it on.

ADD COMMENTlink written 4.1 years ago by DG7.1k

Thanks, good suggestions.

ADD REPLYlink written 4.1 years ago by chen2.1k
gravatar for John
4.1 years ago by
John12k wrote:

I'm very skeptical that this will perform anywhere near as well as a 'normal' SNP calling pipeline. I don't doubt i'm probably missing something though. Maybe a brain. Can you confirm that:

  1. Adapters and poor-quality bases will be removed and not called as variants, even though the adapter sequence isn't known to your algorithm?
  2. The optical/pcr duplicates will be marked and discarded?
  3. All the usual post-mapping work that informs SNP calling (such as realignment around indels) is done, or is not an issue due to the way you make your graphs?
  4. You can detect frameshift mutations anywhere in the gene body without specifying exactly what that mutation should look like.
  5. How can you produce VCFs without mapping to the genome? I suppose you cannot. Which suggests this would be incompatible with all the other down-stream variant calling tools out there.

Personally I think the 20x faster claim is like comparing apples to oranges. My gut-instinct is to think there's no way this approach, with less information about the genome, can possibly detect variants as well as the 'normal' method. Worse, I suspect it will be used inappropriately by people looking to cut corners.

On the positive side, I suppose to even check 1 entry in the CSV file, you have to build the entire graph. If you could write-out the graph after building it, I can see a number of time-memory tradeoff techniques could be developed in the future. This would also be the go-to tool for looking at variants in organisms without sequenced/annotated genomes. Although it's debatable how your CSV file would look on a organisms without a sequenced genome...

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by John12k

Thanks for your comments.

Actually this tool is used to detect low frequency mutations in deep sequencing, it is not a complete, or even a tiny variant caller.

1, Adapters is not processed here, but it actually doesn't affect the result. Quality is handled, low quality reads will be filtered.
2, No deduplication implemented. But the result HTML report can show the duplicates.
3, Post mapping work is not needed, becuase this tool doesn't use the alignment information.
4, Frameshift, or small indel is able to be handled, just provide the sequence after indel in the CSV.
5, No VCF generation.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by chen2.1k
gravatar for harold.smith.tarheel
4.1 years ago by
United States
harold.smith.tarheel4.6k wrote:

Since I didn't get a response to my comment/questions, I'll repost as an answer:

1) I can envision how the tool might add annotations to variants that are listed in the optional mutation file. But how does it treat de novo mutations, or cases where the mutation file is not provided?

2) Absent a reference and mutation file, how do you call a homozygous variant?

3) You state that the tool is useful for ultra-low frequency mutations. How do you discriminate those from common sequencing errors?

ADD COMMENTlink written 4.1 years ago by harold.smith.tarheel4.6k

This tool is not a variant caller. It just helps eliminating the false negatives for low frequency mutation detection.
Some applications, like circulating tumor DNA sequencing, the mutated reads are usually very few, and may be not detected by normal pipelines.
In this case, this tool can scan the important mutation locus (like EGFR L858R, which makes patients sensitive to EGFR TKI treatment) to check for false negative.
If you have experience with ctDNA sequencing, you may understand why I developed this.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by chen2.1k
gravatar for Asaf
4.1 years ago by
Asaf8.4k wrote:

I'm wondering if this approach can be adapted to detect closely-related strains in metagenomics samples. If, for instance, you have a region of a relatively conserved gene you will see some variation in it when you have close strains or species. Since mapping is usually not an option and assembly would probably miss these kind of variations, using such a tool will give very important input for how to assemble the sample (error tolerance etc.) Any thoughts?

ADD COMMENTlink written 4.1 years ago by Asaf8.4k
gravatar for chen
4.1 years ago by
chen2.1k wrote:

This tool can be very useful for cancer somatic mutation detection, especially for detecting ultra-low frequency mutation from deep sequencing data.

This tool can be used directly in liquid biopsy, like ctDNA sequencing.

ADD COMMENTlink written 4.1 years ago by chen2.1k

Yes, in theory it works, but in practice you need to prove it works to high accuracy and works better than other alternatives.

ADD REPLYlink written 4.1 years ago by lh332k


We're just testing it with ~1000 cfDNA samples. I will update the result once it's done.

ADD REPLYlink written 4.1 years ago by chen2.1k

Hi Chen,

Were you able to test out 1000 cfDNA samples? I would be very interested in knowing the answer.


ADD REPLYlink written 3.2 years ago by caspase8mach10
gravatar for lebedana21
2.8 years ago by
lebedana210 wrote:

Hi Chen,

I was trying to run the stable releases of MutScan on my data, by executing the following command: ./mutscan -1 my_reads.fq -m my_mutations.csv -S 1

I was expecting to see output for variants, for which at least a single read match the mutation specified in the my_mutations.csv. But I had obtained the following: No mutation will be scanned Scanning 0 mutations... Loaded all of 1000 reads MutScan didn't find any mutation. However, I'm sure that there are multiple reads supporting the mutation.

May I ask you to provide some toy data files, so that I could run the program. Or provide a command which will be executable on data already present in /testdata repository.

Thank you in advance

ADD COMMENTlink written 2.8 years ago by lebedana210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1746 users visited in the last hour