Question

Looking for data

0

Entering edit mode

2.9 years ago

Mollie ▴ 10

Hi, I am wondering if someone can provide me a data set to test across plink and EpiGen and possibly some others. I need a x.gen file and an x.csv file. Im looking at comparing the methods of analysis and don't actually care about the data so the smaller the better. Thanks

epistasis • 991 views

ADD COMMENT • link updated 2.9 years ago by Kevin Blighe 87k • written 2.9 years ago by Mollie ▴ 10

0

Entering edit mode

Hello and good evening Mollie. Are you trying to follow a tutorial on the World Wide Web? There are test datasets here: https://www.cog-genomics.org/plink/1.9/resources

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

My problem with that data is I want to use it also in SNP_TEST and GenEpi. SNP_TEST requires a .gen file and a .sample of phenotypes and GenEPi requires a .gen file and .csv of phenotypes. I have figured out how to convert a .bed file to a .gen but not sure about the phenotype part using those plink1.9 provided sets. Thank you

ADD REPLY • link 2.9 years ago by Mollie ▴ 10

0

Entering edit mode

Can you please link us to the documentation page(s) for these programs where the phenotype file's format is described.

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I have a hapmap1.gen file and I think that the .bim version can be used as the phenotype because of the column arrangement? heres the GenEpi docs: https://genepi.readthedocs.io/en/latest/format.html#input-phenotype-data Trying to run the sample data for SNP_TEST gives me this warning: !! Error: mismatch in column names or types between the sample files "./example/cohort1.sample", "./example/cohort2.sample". which doesn't make sense since they should provide this data ready to go?

And SNP_TEST: https://www.well.ox.ac.uk/~gav/snptest/#input_file_formats

ADD REPLY • link 2.9 years ago by Mollie ▴ 10

0

Entering edit mode

Hey again, thanks! GenEpi states:

GenEpi takes the common .CSV file without header line as the input format for phenotype and environmental factor data. The last column of the file will be considered as the phenotype data (e.g. 1 or 0, which indicate case/control, respectively) and the other columns will be considered as the environmental factor data.

So, I guess that this means a file like this:

1,2,3,4,0
2,3,1,1,1
1,1,2,2,1
...
3,2,4,4,0

Here, the last column is the outcome, encoded 0 and 1, while the other columns are other phenotypes that you may have. I guess that each row, then, is a sample, and that order should correspond to the order in your GEN file.

I am not sure what is happening with the other program, SNP_TEST...

I am acutely aware that using these programs is very frustrating. Apart from the fact that they all want different input formats, I have also come across situations where the documentation is incorrect and test data does not load as advertised.

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k