Question

Plugging output files from phyluce phasing workflow into Structure

0

Entering edit mode

8 months ago

gabe.somarriba • 0

Hello all,

I am currently attempting to revise a genus of fishes among which we believe there has been a great deal of introgression. From our tissue samples we have extracted and sequenced COI, but ended up with a very jumbled phylogeny. In an attempt to develop a better phylogeny, we have sequenced UCE genomic data for our samples. I ran this data through the phyluce pipeline and created a phylogeny that sorted out nearly all of our samples to discrete clades corresponding to morphologically-identified species, but a few specimens came out in odd places. My hypothesis is that these specimens represent members of a hybrid swarm, and I wanted to test it using halpotype data. I proceeded to run the sequences through the phyluce Phasing Workflow, intending to then create Structure plots from the phased data to visualize how much each lineage was contributing to the outlier individuals, but have never used structure or worked with phased data before so I'm a bit stuck.

I attempted to use this tutorial (), but realized that the phased data may need to be manipulated differently, particularly as the R code asks for different input files than what I got from phyluce (see R code for formatting data through Plink for input into Structure below). These are the files I have for each of my specimens after phasing in phyluce:

Specimen1.0.bam

Specimen1.0.bam.bai

Specimen1.1.bam

Specimen1.1.bam.bai

Specimen1.0.changes

Specimen1.0.fasta

Specimen1.0.vcf

This is the code provided by the tutorial I wanted to use, which uses .bim, .bed, and .fam files as inputs (https://datadryad.org/stash/dataset/doi:10.5061/dryad.v8g21pt) :

# Data check and preparation to examine crossbreds with the Structure software
# See video on the Genomics Boot Camp YouTube channel

# Prerequisites:
# 1) Download and install Structure https://web.stanford.edu/group/pritchardlab/structure.html
# 2) Get data https://datadryad.org/stash/dataset/doi:10.5061/dryad.v8g21pt


# Clear workspace and load packages
rm(list = ls())
library(tidyverse)

# Set the location of the working directory
setwd("XXXXXXXXXXXXXXXXXXXXXXXXXXXXX")


# perform Quality control
system(str_c("plink --bfile ADAPTmap_genotypeTOP_20160222_full --chr-set 29 --autosome ",
             "--keep animalsToKeep.txt --nonfounders ",
             "--geno 0.1 --mind 0.1 --maf 0.05 ",
             "--make-bed --out afterQC"))


#################
# Check PCA plot
#################

system("plink --bfile afterQC --chr-set 29 --pca --out plinkPCA")

###
# Visualize PCA results
###

# read in result files
eigenValues <- read_delim("plinkPCA.eigenval", delim = " ", col_names = F)
eigenVectors <- read_delim("plinkPCA.eigenvec", delim = " ", col_names = F)

## Proportion of variation captured by each vector
eigen_percent <- round((eigenValues / (sum(eigenValues))*100), 2)

# PCA plot
ggplot(data = eigenVectors) +
  geom_point(mapping = aes(x = X3, y = X4, color = X1, shape = X1), size = 3, show.legend = TRUE ) +
  geom_hline(yintercept = 0, linetype="dotted") +
  geom_vline(xintercept = 0, linetype="dotted") +
  labs(title = "PCA of selected goat populations",
       x = paste0("Principal component 1 (",eigen_percent[1,1]," %)"),
       y = paste0("Principal component 2 (",eigen_percent[2,1]," %)"),
       colour = "Goat breeds", shape = "Goat breeds") +
  theme_minimal()


# prepare file for the Structure software
system("plink --bfile afterQC --chr-set 29 --recode structure --out forStructure")

I'm thinking that I might need to treat each haplotype as a separate specimen in Structure, but I'm still not 100% clear on how the formatting of the phased data files plugs into the Structure software. I was thinking there might be functionality in Plink to directly use the .vcf files so I don't have to try converting the .vcf to the .bim, .bed, and .fam files required by the tutorial, but I'm thinking that someone with more knowledge of these file types and the software could provide insight or even a better method for visualizing the lineage haplotype data so that I can more precisely determine the ancestry of my outlier specimens.

I've been stuck on this for several months now, so any help or suggestions would be greatly appreciated.

Thank you, Gabe

phyluce uce bam phasing structure • 402 views

ADD COMMENT • link 8 months ago by gabe.somarriba • 0