Question: Pormat ped and map file for PLINK with bacteria
gravatar for Seb_Lopez
11 months ago by
Seb_Lopez10 wrote:


I am sort of new in the field. I want to know two things

If someone has worked with PLINK for association studies in bacteria. The chase is that I have a gene presence/absence table and want to assess if one of those genes is significantly related to a particular phenotype. Is this possible with PLINK? I actually saw someone do it and I would like to understand the rationale behind the formatting of the .ped and .map files as well as the analysis.

As far as I remember, the affected (case) and unaffected (control) groups are my bacterial phenotypes, but there's more than that. I think there are some columns to add to those files.

If someone has more experience, please let me know.

Not sure if this is the appropriate place to ask this. If not, my apologies.

plink • 473 views
ADD COMMENTlink modified 11 months ago by zx87547.1k • written 11 months ago by Seb_Lopez10
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe41k
Guy's Hospital, London
Kevin Blighe41k wrote:

Bonjour / Bonsoir, in which format is your data, currently? While I have not heard of anyone using plink for bacteria, I do not doubt the utility of plink in such a situation. Plink's basic association test is just a χ2 (chi-square) approximation and looks at allele tallies in cases and controls (or whatever phenotype(s) you're measuring).. Other tests, like family-related tests would obviously not be suitable. If you look at my recent answer, you will see how you easily just conduct the test yourself: A: SNP dataset and Z Score

Otherwise, here is information on the formatting:

If you create data.ped and, you can then load these into plink with:

plink --file mydata



ADD COMMENTlink modified 11 months ago • written 11 months ago by Kevin Blighe41k

Thanks for your reply Kevin. My data is a table containing groups of bacteria in the rows and in the columns there are genes or gene families. When I mentioned phenotypes in the original question, I actually meant "taxa". So my idea is that I can use PLNK to show that certain genes are uniquely present in certain closely related groups of bacteria (say subspecies or strains) or that they are "associated" with a particular taxon. For example: species 1 has gene X that is not present in species 2 , 3, 4 and 5. I am guessing ploidy is a limitation that could be addressed by formatting the data table in a way that it resembles a diploid organism. Here is an example of the table I have.

Converting to binary is a must, as far as I remember. After that I'm quite lost.

Thanks for your help

ADD REPLYlink written 11 months ago by Seb_Lopez10

I see. I am beginning to think that you should do this entirely outside of Plink, like, using some of the tests that I mentioned in my other thread. With those, you can see if a gene is more frequent in a particular bacteria or taxa. What do you think?

Another thing that you could do with your data is to define a gene signature that could be used as a sort of 'identifier' of the taxa that you are aiming to distinguish. For example, you could ultimately say that Gene1+Gene4+Gene7+Gene8 can statistically distinguish Taxa1 from Taxa2 (AUC, 0.95; cross-validated r^2, 0.6). If you want to learn more about that, you can take a look here: C: Resources for gene signature creation

Not sure if that helps.

ADD REPLYlink written 11 months ago by Kevin Blighe41k

I will take a look at that. Maybe you are right and PLNK is not the most straightforward answer for this question. I'll update on progress if necessary. Thanks again.

ADD REPLYlink written 11 months ago by Seb_Lopez10

Okay - please come back when you have updated information.

ADD REPLYlink written 11 months ago by Kevin Blighe41k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1814 users visited in the last hour