Question: Complete Genomics data analysis, pipeline version and batch effect
gravatar for MAPK
5.3 years ago by
MAPK1.7k wrote:

I am working with Complete genomics data from pipeline version 2.5. I need to add 1000 genome data along with my sample and make a multigenome vcf file. Since the 1K genome project data are from 2.0.0 version, I was wondering if this is something I should be concerned about? If there is any batch effect, what would you normally expect in the CG data with 2.0.0 vs 2.5 pipeline version? 
Additionally, I would also like to know if mkvcf tool is the right tool to merge multi genome data and make a combined vcf. Is there a proper tool to annotate that vcf ?

ADD COMMENTlink modified 5.3 years ago by Dhana80 • written 5.3 years ago by MAPK1.7k
gravatar for Dhana
5.3 years ago by
Helsinki, Finland
Dhana80 wrote:

For the annotation part, you can use cgatools join command. Since the data is also from Complete Genomics Inc. it will be easier to use cgatools for most part.

You can use it as;

cgatools join --beta
--input <file1> <file2> \
--match <specifications> \
--overlap <specifications> \
 --select <output_fields_required> \
--output-mode <arg> \

these are the minimum specification you have to provide to run the tool.

ADD COMMENTlink modified 15 months ago by Ram32k • written 5.3 years ago by Dhana80

Thanks Dhana. However, I need to merge all the genomes and looks like join tool only takes two files at a time.

ADD REPLYlink written 5.3 years ago by MAPK1.7k

Yes the join tool takes only two files as input. But that does not limit its uses, since the tool is able to read input from stdin and pass output to stdout. You can write a loop in bash/python for it to merge all the files.

ADD REPLYlink written 5.3 years ago by Dhana80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1745 users visited in the last hour