Question: 012 genotype matrix using vcf tools (converting rows to columns)
gravatar for Ana
9 months ago by
Ana100 wrote:

Hello everyone, I have created a 012 genotype matrix by using --012 command in vcf tools which give individuals in row and SNPs in columns. I want to run smartpca and and need to convert rows to columns. Means get SNPs in rows and individuals in columsn. I have tried some shel commands like this but seems it does not work.

awk ' { 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    } } NF>p { p = NF } END {    
    for(j=1; j<=p; j++) {
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        print str
    } }'

I wonder if anyone has any idea how to do that? Is there any way to do this manipulation in R? but R cannot read my output genotype matrix! thanks

genotype matrix vcf-tools • 762 views
ADD COMMENTlink modified 9 months ago by Kevin Blighe21k • written 9 months ago by Ana100

Please provide a few lines of your matrix (with headers). Try datamash (GNU tool available in most of the linux repos) to transpose data easy. input:

  $ cat test.txt 
    Sample  id-123  id-99   id-42   id-13
    Count   1002    990 2030    599


$ datamash transpose < test.txt  
Sample  Count
id-123  1002
id-99   990
id-42   2030
id-13   599
ADD REPLYlink modified 9 months ago • written 9 months ago by cpad01126.3k
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

You just literally want to transpose the matrix?

It works for me in R Programming Language, just use the t() function:

df <- t(read.table("out.012", header=FALSE))

This will give you the 012 matrix without rownames or colnames. The use of the [-1,] sub-setting just gets rid of the uninformative header that VCFtools outputs with -012

To add the colnames, just use:

colnames(df) <- read.table("out.012.indv")[,1]

I will let you do the rownames for yourself.

ADD COMMENTlink modified 9 months ago • written 9 months ago by Kevin Blighe21k

Hi @Kevin Blighe, thank you for your response. Well are you sure that you could transpose 012 genotype matrix created by vcf-tools in R? This is my out pot from vcf file (012-out)

> 0 -1  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   -1  0   0   0   0   0   1   0   0   0   0   0   0   -1  0   -1  -1  -1  0   0   -1  -1  -1

and R cannot read this file.I use this just to see

> df <- read.table(file = "012-out.txt", header=FALSE)
> head(df)

and the output doesnot make sense

> V636126 V636127 V636128 V636129 V636130 V636131 V636132 V636133
> V636134 V636135 V636136 V636137 V636138 V636139 V636140 V636141
> V636142 V636143 V636144 V636145 V636146 V636147 V636148 V636149
> V636150 V636151 V636152 V636153 V636154 V636155 V636156 V636157
> V636158 V636159 V636160 V636161 V636162 V636163 V636164 V636165
> V636166 V636167 V636168 V636169

I do not know what is wrong here and how can I solve it! I would sincerely appreciate if you could share your experience with me to get this done.

ADD REPLYlink modified 9 months ago • written 9 months ago by Ana100

Yes, it genuinely works for me. I use VCFtools (0.1.15) and Microsoft R Open 3.2.5. You haven't done any conversions on the file after it is produced by VCFtools?

Your output above with the 'V' prefix is just the colnames that R automatically generates. I assume that there are 636,169 variants in your VCF?

Also, just to confirm, how many individuals are in your VCF? - 1 or more? The number of rows in the output file from VCFtools should match the number of samples in the VCF.

If you have >1 individuals, the line to read-in and transpose is:

df <- t(read.table("012-out.txt", header=FALSE))

If you just have 1 individual, then this works:

df <- t(read.table("012-out.txt", header=FALSE))
ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe21k

Hi @Kevin Blighe, Thanks. The the example above is just with subset of the data. yes! you are right, did you also get output with "V" prefix? By the way I found the solution, It works ... thanks so much

ADD REPLYlink modified 9 months ago • written 9 months ago by Ana100

No, if I do the command exactly as I have it above (df <- t(read.table("out.012", header=FALSE))), and then run head(df), I get:

   [,1] [,2] [,3] [,4]
V1    0    1    2    3
V2    2    0    1    1
V3    2    2    2    2
V4   -1    2   -1   -1
V5    2   -1   -1   -1
V6   -1   -1   -1    2

This was a VCF file of 4 individuals and ~300,000 variants.

With a single individual, I get:

V1    0
V2    2
V3    2
V4   -1
V5    2
V6   -1

I also tried it on a VCF with >600,000 variants and it still functioned fine.

Which was the solution that worked?

ADD REPLYlink written 9 months ago by Kevin Blighe21k

So I created this little Rscript and got my desired output, thanks Kevin for sharing your experience with me:

>     #!/usr/bin/Rscript
>     snps<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1._replaced_9.geno.table.012',header=F,row.names=1)
>     pos<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1.geno.table.012.pos',header=F)
>     indv<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1.geno.table.012.indv',header=F)
>     colnames(snps)<-paste(pos[,1],pos[,2],sep=':')
>     rownames(snps)<-indv[,1]
>     snps<-as.matrix(snps)
>     snps.convert<-t(snps)
>     write.table(snps.convert, file= "snps.convert_all_pruned", col.names = FALSE, row.names = TRUE)
ADD REPLYlink modified 9 months ago • written 9 months ago by Ana100

That's great! - good luck with the analysis

ADD REPLYlink written 9 months ago by Kevin Blighe21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour