How to get REF and ALT alleles from a genotype data?
1
0
Entering edit mode
15 months ago

Dear experts, I have a genotype data which I want to use for GWAS. The genotype data contains all columns, except allele columns i.e. Ref and Alt alleles. It has all other information, such as chromosome position, chromosome number, and the alleles in my sample etc. It has already been aligned to the reference genome, but I am confused about the Ref and Alt alleles. Is there any way to get it? any software which can extract reference and alternative allele? It is not in any format. Its just a text file. I need to find the alleles for association.

snp genome R sequencing • 896 views
0
Entering edit mode

Can you please post a small sample of what this data looks like?

0
Entering edit mode

Sorry, my genotype data looks like this,

Chrom   position    sample1 sample2 sample3
1   1234        AA  TT  TT
1   56545       GG  AG  TT


there is no allele column, I tried to fill NA's to make it HapMap format, and then converted it to VCF by using Tassel. I filled every information column with NA's including the allele column, because allele column is not needed for association in some packages, but it is needed for annotation of gwas results. I need to find alleles before doing GWAS. This is the format I made by filling NA's. It is HapMap format. I converted it to vcf also.

rs#    alleles    chrom    pos    strand    assembly#    center    protLSID    assayLSID    panelLSID    QCcode   sample 1
44509    NA    02     5565755     +    NA    NA    NA    NA    NA    NA    AG    AA    AA    AA
38019    NA    02     43878360     +    NA    NA    NA    NA    NA    NA    GG    GA    GG    GG
89440   NA    04     25220824     +    NA    NA    NA    NA    NA    NA    TC    TT    NN    TT


sorry, I am unable to upload complete picture, there is no option to upload picture.

0
Entering edit mode

Do you mean that in the top sample above you want to know (or example), whether T or A is the REF allele (with the other being the ALT)?

0
Entering edit mode

Yes, that's what I am trying to find. I tried a method in tassel to find Ref and Alt alleles, but I am not sure whether it is right or wrong. I've converted my text file to HapMap format to make it readable by Tassel, by filling NA's in the allele column, then converted it to VCF. This way, it gives the Ref and Alt alleles. Tassel assign alleles on the basis of allele frequency i.e. major allele as REF allele. Is there any other method, which can accurately find alleles?

0
Entering edit mode

I can think of a couple of ways. Is this human data? If so you can probably use the rsIDs to look up the ref and alt alleles in SNPdb using (I would guess) biomart. Or if its not human, but you have a VCF of the known SNP locations in the genome, you can go thorugh and match them up.

Finally, in the abscence of all that, I'd guess you could write a script to use the chromosome and location to look up what the reference genome sequence is at that position, then mark that as the REF allele and the other as ALT.

0
Entering edit mode

How many lines do you have of this?

0
Entering edit mode

thanks, I have 300K snps in my dataset. I cannot search it by rsIDs, because it is a plant data. I think plants don't have rsIDs, like humans snps have. Therefore, I need to look for other ways to find alleles.

0
Entering edit mode

can you tell us what is AA in sample1 or TT in sample2

0
Entering edit mode

AA and TT are SNPs in my sample.

1
Entering edit mode
14 months ago
Emily 23k

One option would be to use the Ensembl REST API with the region/overlap endpoint to fetch the alleles. Here's an example with wheat: http://rest.ensembl.org/overlap/region/triticum_aestivum/4A:714193714-714193714?content-type=application/json;feature=variation

You would need to use your favourite programming language to run through your list, run the script and fill in the gaps. I suspect this would take a long time to run.

An alternative with the Ensembl REST API would be to use the sequence/region endpoint to get the reference allele at this locus, and, as @i.sudbury suggests, use this to infer the alt. The benefit of this over the overlap endpoint is that it's also available as a POST endpoint allowing you to query in batches of 50, which would be quicker.

Another way to go about it might be to find the reference VCF for your species and try to use soemthing like VCFTools to intersect between your data and the reference.

0
Entering edit mode

Thanks, I understood