Question: Getting data into Plink! and TreeMix
1
gravatar for hermathena
5.1 years ago by
hermathena40
United Kingdom
hermathena40 wrote:

Dear All,

I am trying to get non-model data into the software TreeMix (https://code.google.com/p/treemix/). I have data from multiple species, 1-10 individuals for each. I start with a VCF of biallelic SNPs from GATK and generate Plink input with vcftools:

vcftools --vcf exon.biallelicSNPs.vcf --plink --out treemix.plink --keep some_individuals.txt

I then run Plink, using a cluster file that assigns individuals to their species:

plink --file treemix.plink --freq --noweb --missing --within treemix.data.clust

Several warnings like this are produced - not sure what they really mean:

Duplicate marker name found: [ 1347331 ]

Anyway, there is a plink.frq.strat output file. I gzip that and use the auxiliary script (https://code.google.com/p/treemix/downloads/detail?name=plink2treemix.py) of TreeMix to make their input files. I get the error:

Traceback (most recent call last):

  File "plink2treemix.py", line 24, in <module>

    total = line[7]

IndexError: list index out of range

 

Hmmmm. Really not sure why! If anyone has experience of getting data into treeMix, especially for non-human taxa where the earlier steps may have been the problem, I would be very grateful!!!

Many thanks,

 

snp next-gen genome • 7.2k views
ADD COMMENTlink modified 13 months ago by adrian.casanova.chiclana0 • written 5.1 years ago by hermathena40
1

hi, have you solved this problem ? I have same problem. Thank you

ADD REPLYlink written 2.8 years ago by zhengchenfei60

I have managed to use TreeMix. The way I did it is that I had illumina reads from different isolates of the same species, so I mapped each of my library to the same reference individually and called SNPs using the PoPoolation toolkit. I then had quick script to change the PoPoolation output to the TreeMix standard.

ADD REPLYlink written 5.0 years ago by Adrian Pelin2.3k

Hi all,

I have been the same problem, though the error is:

Traceback (most recent call last):
  File "./plink2treemix.py", line 23, in <module>
    mc = line[6]

Did anybody solve it?

Felipe

ADD REPLYlink modified 9 months ago by RamRS24k • written 21 months ago by felipe_o_torquato0
1
gravatar for hermathena
2.7 years ago by
hermathena40
United Kingdom
hermathena40 wrote:

Hi, Plink v. 1.9 does a very good job. You can ask it to read in vcf and --make-bed:

/bin/plink2/plink --vcf input.vcf --make-bed --geno 0.25 --maf 0.01 --snps-only --out input.qual --allow-extra-chr --within my_clusters.clust

Unfortunately the resulting .bim file has periods instead of SNP names, which can be a problem for LD pruning etc. It can be easily fixed with awk:

awk '{$2 = $1":"$4; print}' input.bim > snp_names.bim

After that you can filter in Plink, and follow Joe Pickrell's instructions - make a frequencies file, re-format for TreeMix. Glitches like the one above can be fixed manually, occasionally there is a single incorrect line - presumably you'll have enough data that one removed SNP wouldn't matter (unless you are looking for associations, etc).

ADD COMMENTlink modified 9 months ago by RamRS24k • written 2.7 years ago by hermathena40

Hi kmkozak

Can you detail your answer here by providing all the command line you've been using? I'm currently struggling to convert my bed/map files into Treemix inputs using the dedicated 'plink2treemix.py' software provided onto the Treemix webpage.

ADD REPLYlink written 2.2 years ago by BioCh'ti0
0
gravatar for Gabriel R.
19 months ago by
Gabriel R.2.6k
Center for Geogenetik Københavns Universitet
Gabriel R.2.6k wrote:

I posted this to another question, it is one line in glactools:

glactools bplink2acf --fai reference.fai [PLINK prefix] |  glactools acf2treemix -  |gzip > treemix.gz
ADD COMMENTlink written 19 months ago by Gabriel R.2.6k
0
gravatar for José Cerca
17 months ago by
Oslo
José Cerca0 wrote:

Hi,

I just posted a similar reply in researchgate. I will copy it here. Unfortunately I could not use glactools from Gabriel because I have no reference.fai (my vcf was generated with STACKS denovo).

So, I started with a .vcf file and converted it to the custom plink format. Treemix requires a specifically formatted file. The website says that in order to create this file you need to run:

$ plink --bfile data --freq --missing --within data.clust 
$ gzip plink.frq
$ plink2treemix.py plink.frq.gz treemix.frq.gz

The first codeline ($1) requires a .clst file. .clst files are generated with --write-cluster --family and it consists in three columns including 1. Family ID; 2. ID; 3. cluster. I used several awk commands to format the information properly.

The first codeline ($1) produces a .frq file which according to plink2's webiste includes:

I. CHR [Chromosome code]
II. SNP [Variant identifier]
III. A1 [Allele 1 (usually minor)]
IV. A2 [Allele 2 (usually major)]
V. MAF [Allele 1 frequency]
VI. NCHROBS [Number of allele observations]

So, here start the problems. If you open the provided python file:

$ nano plink2treemix.py

You will notice that the script requires that the file has 8 columns (remember that python starts counting in 0. mc = line[6], total = line[7] are columns 7 and 8 respectively). I think this might be a problem with the plink versions. I was using plink2 (v1.9) and the program was written when plink1 (v1.07) was around. So if you hit with the following error:

ERROR: treemix mc = line[6] indexerror: list index out of range

I expect it to be a mismatch between plinks versions.

So then I switched to plink1 and it complained that the SNPs had similar names. Because my data was generated with STACKS denovo, the SNPs do not have a position in the genome and <mydata.bim> had . for SNP. I used awk to solve this:

$ awk '{$2 = $1"_"$4; print}' mydata.bim > mydata.SNPrenamed.bim ################ so, we change the second column ($2) to be a combination including column 1, an underscore and column 4 ( $1"_"$4 ).

Plink kept complaining so I decided to take a new approach.

I noticed that the POPULATIONS mode as part of STACKS 1 had a treemix flag.

I created a population map for stacks, which is a file composed of two columns, the first one with individuals and the second with populations:

IND_01 * TAB * POP_A
IND_02 * TAB * POP_A
IND_100 * TAB * POP_C

So I just went on STACKS v1.48 and did:

/path/to/stacks-1.48/bin/populations --in_vcf mydata.vcf --treemix -O ./ -M pop_map.tsv

--in_vcf : input vcf
--treemix : generate a treemix output
-O : output folder
-M : population map

And it worked!

ADD COMMENTlink modified 9 months ago by RamRS24k • written 17 months ago by José Cerca0

Which is the correct output you get?

ADD REPLYlink written 23 days ago by mgpastos0
0
gravatar for adrian.casanova.chiclana
13 months ago by

Hi,

I had the same problem. I began with my .vcf from STACKS 1.48, de novo pipeline.

I used vcftools to count alleles at each population (I have six), with this command:

vcftools --vcf batch_23001.vcf --counts --keep AG1 --out AG1 (x6 different populations)

AG1 after --keep is a popmap belonging to one population with this format:

Sample1
Sample2
...

Finally, with these vcftools outputs (.frq.count), I used LibreOffice Calc and gedit (text editor) to did the input for TREEMIX.

ADD COMMENTlink modified 9 months ago by RamRS24k • written 13 months ago by adrian.casanova.chiclana0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1582 users visited in the last hour