Question: 012 genotype matrix using vcf tools (converting rows to columns)
gravatar for Ana
2.3 years ago by
Ana170 wrote:

Hello everyone, I have created a 012 genotype matrix by using --012 command in vcf tools which give individuals in row and SNPs in columns. I want to run smartpca and and need to convert rows to columns. Means get SNPs in rows and individuals in columsn. I have tried some shel commands like this but seems it does not work.

awk ' { 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    } } NF>p { p = NF } END {    
    for(j=1; j<=p; j++) {
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        print str
    } }'

I wonder if anyone has any idea how to do that? Is there any way to do this manipulation in R? but R cannot read my output genotype matrix! thanks

genotype matrix vcf-tools • 1.8k views
ADD COMMENTlink modified 2.3 years ago by Kevin Blighe52k • written 2.3 years ago by Ana170

Please provide a few lines of your matrix (with headers). Try datamash (GNU tool available in most of the linux repos) to transpose data easy. input:

  $ cat test.txt 
    Sample  id-123  id-99   id-42   id-13
    Count   1002    990 2030    599


$ datamash transpose < test.txt  
Sample  Count
id-123  1002
id-99   990
id-42   2030
id-13   599
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011212k
gravatar for Kevin Blighe
2.3 years ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

You just literally want to transpose the matrix?

It works for me in R Programming Language, just use the t() function:

df <- t(read.table("out.012", header=FALSE))

This will give you the 012 matrix without rownames or colnames. The use of the [-1,] sub-setting just gets rid of the uninformative header that VCFtools outputs with -012

To add the colnames, just use:

colnames(df) <- read.table("out.012.indv")[,1]

I will let you do the rownames for yourself.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Kevin Blighe52k

Hi @Kevin Blighe, thank you for your response. Well are you sure that you could transpose 012 genotype matrix created by vcf-tools in R? This is my out pot from vcf file (012-out)

> 0 -1  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   -1  0   0   0   0   0   1   0   0   0   0   0   0   -1  0   -1  -1  -1  0   0   -1  -1  -1

and R cannot read this file.I use this just to see

> df <- read.table(file = "012-out.txt", header=FALSE)
> head(df)

and the output doesnot make sense

> V636126 V636127 V636128 V636129 V636130 V636131 V636132 V636133
> V636134 V636135 V636136 V636137 V636138 V636139 V636140 V636141
> V636142 V636143 V636144 V636145 V636146 V636147 V636148 V636149
> V636150 V636151 V636152 V636153 V636154 V636155 V636156 V636157
> V636158 V636159 V636160 V636161 V636162 V636163 V636164 V636165
> V636166 V636167 V636168 V636169

I do not know what is wrong here and how can I solve it! I would sincerely appreciate if you could share your experience with me to get this done.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Ana170

Yes, it genuinely works for me. I use VCFtools (0.1.15) and Microsoft R Open 3.2.5. You haven't done any conversions on the file after it is produced by VCFtools?

Your output above with the 'V' prefix is just the colnames that R automatically generates. I assume that there are 636,169 variants in your VCF?

Also, just to confirm, how many individuals are in your VCF? - 1 or more? The number of rows in the output file from VCFtools should match the number of samples in the VCF.

If you have >1 individuals, the line to read-in and transpose is:

df <- t(read.table("012-out.txt", header=FALSE))

If you just have 1 individual, then this works:

df <- t(read.table("012-out.txt", header=FALSE))
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Kevin Blighe52k

Hi @Kevin Blighe, Thanks. The the example above is just with subset of the data. yes! you are right, did you also get output with "V" prefix? By the way I found the solution, It works ... thanks so much

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Ana170

No, if I do the command exactly as I have it above (df <- t(read.table("out.012", header=FALSE))), and then run head(df), I get:

   [,1] [,2] [,3] [,4]
V1    0    1    2    3
V2    2    0    1    1
V3    2    2    2    2
V4   -1    2   -1   -1
V5    2   -1   -1   -1
V6   -1   -1   -1    2

This was a VCF file of 4 individuals and ~300,000 variants.

With a single individual, I get:

V1    0
V2    2
V3    2
V4   -1
V5    2
V6   -1

I also tried it on a VCF with >600,000 variants and it still functioned fine.

Which was the solution that worked?

ADD REPLYlink written 2.3 years ago by Kevin Blighe52k

So I created this little Rscript and got my desired output, thanks Kevin for sharing your experience with me:

>     #!/usr/bin/Rscript
>     snps<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1._replaced_9.geno.table.012',header=F,row.names=1)
>     pos<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1.geno.table.012.pos',header=F)
>     indv<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1.geno.table.012.indv',header=F)
>     colnames(snps)<-paste(pos[,1],pos[,2],sep=':')
>     rownames(snps)<-indv[,1]
>     snps<-as.matrix(snps)
>     snps.convert<-t(snps)
>     write.table(snps.convert, file= "snps.convert_all_pruned", col.names = FALSE, row.names = TRUE)
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Ana170

That's great! - good luck with the analysis

ADD REPLYlink written 2.2 years ago by Kevin Blighe52k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1206 users visited in the last hour