Converting PLINK to EIGENSTRAT error using convertf (no valid snps)
1
2
Entering edit mode
6.1 years ago
beausoleilmo ▴ 510

I'm trying to convert a set of PLINK files to EIGENSTRAT, but I have this error:

\$ convertf -p par.PED.EIGENSTRAT
parameter file: par.PED.EIGENSTRAT
outputformat: EIGENSTRAT
familynames: NO
warning (mapfile): bad chrom: 0 scaffold440:420 0   420
warning (mapfile): bad chrom: 0 scaffold440:451 0   451
warning (mapfile): bad chrom: 0 scaffold440:452 0   452
warning (mapfile): bad chrom: 0 scaffold440:460 0   460
warning (mapfile): bad chrom: 0 Scaffold1210:18 0   18
warning (mapfile): bad chrom: 0 Scaffold1210:23 0   23
warning (mapfile): bad chrom: 0 Scaffold1210:30 0   30
warning (mapfile): bad chrom: 0 scaffold420:41  0   41
warning (mapfile): bad chrom: 0 scaffold420:186 0   186
warning (mapfile): bad chrom: 0 scaffold420:264 0   264
genetic distance set from physical distance
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 120643 12064280
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 121325 12132527
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 122443 12244315
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 120497 12049703
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 117650 11765020
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 117683 11768331
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 118269 11826876
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 118381 11838067
snp order check fail (gdis order != physdis order): output_in_plink.map (processing continues)   99 118475 11847481
fatalx:
no valid snps
Aborted


Why it's saying that I have no valid SNPs?

I think it's because

1. I don't have the chromosomes names and
2. I don't have the genetic distances (in cMorgan)

Is there a way that I can fix that?

Is there another way to compute a Genetic PCA. Can I run a genetic PCA only on the 012 file from VCFtools?

EIGENSTRAT PLINK SNP PCA • 5.1k views
0
Entering edit mode

How do we identify these parameters?

0
Entering edit mode

Hello, I have the same problem as you, but I don't understand how you finally solved the format conversion problem, I want to do three groups of tests, in the transformation of this problem, you can help me?I can't thank you enough!

0
Entering edit mode

Hi, I have the same problem as you. I'd like to know how did you finally solve this problem? Could you help me or give me some suggestions about this? Thanks in advance.

1
Entering edit mode
6.1 years ago
Vincent Laufer ★ 2.6k

Based on examination of that output there are 2 problems.

1) you have unallowable chromosome designations. you have in the above output

warning (mapfile): bad chrom: 0 scaffold440:452 0   452
warning (mapfile): bad chrom: 0 scaffold440:460 0   460


which corresponds to a chr 0 (which there isnt any such chromosome, we start numbering at 1).

For that issue you have a couple options. You can either go back through the data files provided by the company from which you bought the genotyping or sequencing arrays, find the correct chromosome and position, and reassign them, or you can get rid of those entries.

I would do a quick search to see how many of those there are. If there are, for example, 100 of 1,000,000 SNPs labelled as 0, I might just remove those SNPs (using Plink) and rerun.

2) the next problem is that the ordering of genetic distances in the map file does not correspond to the ordering of physical distances. In this case it looks like that is because, as you say, you dont have genetic distances assigned. It is actually tought to tell what is going on there based on the output provided. It looks like there are only 3 columns instead of the normal 4. Therefore it is possible it is an issue with the input formatting not with assignment of distances; I am not sure. Genetic distances can be obtained from a couple sources, for instance Hapmap, or you can find a different PCA program that will compute this without genetic distances (there are several).

I am not sure if you can run a PCA from VCF tools, but there are now several programs that will compute PCs for you directly on a VCF file. If you want more options please let me know.

0
Entering edit mode

I understand, but the organisms that I study don't have the chromosomes (we just don't have this information) so that I can't have the physical "real" distance. That's why I have chromosome = 0 and distance = 0. But is there a way to compute a pca without this information?

1
Entering edit mode

OK I see. I do not have much direct experience with such a scenario ... so I think in this case the best bet might be to read Price et al. 2006 on PCA for genetic data to get a theoretical basis. I am not certain to what degree you do or do not have a reference or assembly, for instance, so it is hard to advise you.

Theoretically you could compute PCs without that information, but there would be a lot of complications that could arise depending on the above issues. I'll try to write more tonight.

0
Entering edit mode

This is the maximum information I have on my genome: http://gigadb.org/dataset/100040.

0
Entering edit mode

1) how many genomes do you have (how many organisms) 2) if you have more than 1, are any of the finches related?

0
Entering edit mode

1) I have 1 reference genome and a total of 96 individuals. 4 species in total. 2) they are totally related, they are introgressive species (species complex). A) genetically and B) morphologiically. But there is enough distinction to make a pca on their morphology.

1
Entering edit mode

Thank you very much for your reply. Alkes Price, the author of the 2006 paper on using PCA on Genome Wide chip data, wrote another paper in 2010. This paper argues the following key point:

"GWAS can be confounded by population stratification—systematic ancestry differences between cases and controls—which has previously been addressed by methods that infer genetic ancestry. Those methods perform well in data sets in which population structure is the only kind of structure present, but are inadequate in data sets that also contain family structure or cryptic relatedness. Here, we review recent progress on methods that correct for stratification while accounting for these additional complexities."

The manuscript is here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2975875/

After the publication of this review, several additional manuscripts have been published detailing the use of random effects models. Many of these papers can be found by searching for "papers that cite the above paper." It strikes me that, for your sample, these models might give you more meaningful results than a PCA would, depending on your questions of interest.

Regardless of your final approach, I would search around for extra genomes in those 4 species and potentially even other related species or subspecies that could be included in the analysis. Otherwise, the PCA or other method you ultimately decide on will only measure the variation in your dataset, without measuring a potentially larger body of variation and placing your samples within that expanded space.

Also, have you come up with a consensus list of only SNPs appearing in all 4 species? For instance a given SNP might be monomorphic in 2 species but polymorphic in 2 species. Presumably if all the data are gathered into one VCF file then it can still be run but a lot of times people extract only a fraction of all the variants and then run the PCA only on that subset. It strikes me that this might be slightly more difficult and more important in your study.

1
Entering edit mode

Finally it's working! .

In the article you shared, it says:

Principal Components Analysis (PCA) is a tool that has been used to infer population structure in genetic data for several decades, long before the GWAS era17–20. It should be noted that top PCs do not always reflect population structure: they may reflect family relatedness19, long-range LD (for example, due to inversion polymorphisms4), or assay artifacts10; these effects can often be eliminated by removing related samples, regions of long-range LD, or low-quality data, respectively, from the data used to compute PCs. In addition, PCA can highlight effects of differential bias that require additional quality control21.

This make sense because the filtering options you use in your VCF will actually change the "story" you want to tell. I don't know exactly what's the best practice in this case. But in my study, the species are related. The biggest problem with PCAs I guess would be the interpretation. Right? But it would generally give an idea of the relatedness of the individuals in the analysis.

I used LEA instead of EIGENSTRAT. Here are some explanations.

If the PCA is not working, it might be because there is a problem in the encoding of the .geno file.

For example, I had a critical problem concerning the last line in my .geno file.

You NEED to have an empty line at the end of the .geno file.

If RStudio just abort, run the lines with only R (The small GUI). You'll have a better understanding of the error message (it gives more details about the error).

If there isn't a line at the end of the document, it might create a "memory leak" which crash RStudio.