Question: Variant Analysis on MiSeq Data
gravatar for gkuffel22
4.2 years ago by
United States
gkuffel2270 wrote:

Hi everyone,

I am trying to figure out the most efficient method to perform variant analysis on a large dataset. I have 200 samples and forward and reverse reads for a total of 400 fastq files. I was able to load all of these files into Galaxy and create a workflow that looks like this:

-fastQ Groomer-Trim-BWA-mem-flagstat-Generate pileup-Filter pileup


I have now realized that there is no way to loop through or automate my workflow on my fastq files. Is there a better way to do this other than running this workflow 200 times manually? Can create a script through the command line and use my fastq files as the input? If anyone has any suggestions or is aware of software to handle this type of job I would really appreciate your help.




variant analysis snp • 1.5k views
ADD COMMENTlink modified 4.2 years ago by Zaag720 • written 4.2 years ago by gkuffel2270

Do you have to use galaxy? If so, you might want to post on the galaxy-specific version of this site. If not, you can certainly just create a script to do this for you (that's what most of us do).

ADD REPLYlink written 4.2 years ago by Devon Ryan90k

I am completely open writing a script and leaving Galaxy behind, I just don't know where to start. I have some programming experience (Java, python) any suggestions would be helpful.

ADD REPLYlink written 4.2 years ago by gkuffel2270

Popular options would be to use shell scripts or a Makefile. You could also use python, but I imagine that'd prove a bit more work. There's also ngsxml, though I have to confess not being very familiar with it (though the author, Pierre, is a regular here and writes great stuff, so I expect it's good).

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Devon Ryan90k

Do you have Linux system? I am bioinformatician and we have MiSeq and HiSeq - I wrote lot of shell script designed for Illumina reads - Filtration - Alignment - Variant calling. Do you want o share it?

ADD REPLYlink written 4.2 years ago by Paul1.3k
gravatar for Yahan
4.2 years ago by
Yahan370 wrote:

Assuming that you are working in a grid environment, what we use is bpipe, an excellent tool to develop pipelines and workflows. You could use it to perform the different steps needed to arrive at your snp calling. One of the advantages of bpipe is that it manages the parallelisation of the different steps for you. The documentation has an example of a snp calling workflow.

Then, it also depends on what snp caller you want to use. If you would use samtools or GATK using read mappings in bam files, then you will need a full pipeline doing read mapping, sorting, duplicate removal, realignment, indexing etc.

However, discoSnp, is an interesting alternative that does snp calling without a reference. This would limit your needs to quality trimming after which you can do the calling in one command line including all your samples. Not sure how it performs on 200 samples though. It does not support paired end reads so you would also have to merge your paired data into one fastq per sample, but maybe that's a trade off you are willing to accept considering the simplification of the task it implies.


ADD COMMENTlink written 4.2 years ago by Yahan370
gravatar for Zaag
4.2 years ago by
Zaag720 wrote:

I only use galaxy for small jobs, but with Workflow Control you can select Input Dataset Collection, maybe that helps.

ADD COMMENTlink written 4.2 years ago by Zaag720
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1711 users visited in the last hour