Question

help with weird PCA? (vcfR)

0

Entering edit mode

5 months ago

MaeBH • 0

Hi everyone, very new to bioinformatics

I have a SNP datasheet (.vcf) that I tried to make a PCA graph with using Rstudio (vcfR package) and it gave me interesting clustering. I then tried to filter the dataset using vcftools in Linux terminal (missing data: 0.80; MAF=0.05) and the PCA still has weird straight arms and low % variance explained

Just wondering a couple things:

what might be causing this?
how to get rid of it?
would this affect downstream analysis?

Any help would be greatly appreciated

PCA from raw data

vcftools vcfR R • 883 views

ADD COMMENT • link 5 months ago by MaeBH • 0

0

Entering edit mode

Could you better describe your data? How many variants would you left with if you remove all the missing sites?

Is this sequencing data? In that case you should also filter for read depths and genotype qualities.

ADD REPLY • link 5 months ago by barslmn ★ 2.1k

0

Entering edit mode

Hi thank you both for your responses :)

Yes, this is sequencing data (de novo ddRADseq). The data is from diploid plant material sampled across a landscape. Each population is a family comprised of a maternal plant and her progeny

Initial attributes: (shown in PCA above)

number of samples: 142
number of SNPs: 531,544

Have now filtered the data using the following criteria using vcftools: (Missing data: 0.80 ;MAF-0.5 ;minGQ: 0.9 ;minDP: 10)

number of samples: 142
number of SNPs 1509

enter image description here

ADD REPLY • link 5 months ago by MaeBH • 0

0

Entering edit mode

It seems like an improvement but variant count dropped drastically :? What is making it drop so much?

ADD REPLY • link 5 months ago by barslmn ★ 2.1k

0

Entering edit mode

I feel like it may be the percent of missingness within many of the samples being pretty high. Haven't been able to make a PCA with the reduced number of samples (it's now saying the vector numbers are incorrect)

ADD REPLY • link 5 months ago by MaeBH • 0

0

Entering edit mode

I agree with the other commenter - we need more information. Especially a rough idea of the number of variants before/after filtering. Also, what is the relationship between population? What is the depth of the sequencing?

Given most SNP tools only emit information about variant sites, this could just be indicative that there is some unique variation to each population. However, given the low PC inertia I would guess most variable sites are either quite variable both within and among populations, or there are many variants unique to only a few individuals (and not in the same population and this would increase PC inertia).

ADD REPLY • link 5 months ago by dthorbur ★ 1.9k

0

Entering edit mode

What would be the best way to test for how variable sites are and or the number of unique variants? Sorry if the questions are really basic, still trying to wrap my head around everything

ADD REPLY • link 5 months ago by MaeBH • 0

1

Entering edit mode

I can't think of any tools off the top of my head, but I'm sure they exist. If you were to do it manually, you could part the VCF's genotype field and annotate which population and how many individuals. With a decent grasp of R and knowledge of what the GT field denotes you shouldn't have a problem doing this. I haven't used ddRADseq so I don't know about whether if it's appropriate, but a structure barplot would be another way of visualizing the distribution of SNPs. See this tutorial.

Overall, I think your data generally looks okay. Hard to tell with the colours, but your populations are generally clustering together.

ADD REPLY • link 5 months ago by dthorbur ★ 1.9k