Question: How to get REF and ALT alleles from a genotype data?
0
gravatar for biotechnology415
10 days ago by
biotechnology4150 wrote:

Dear experts, I have a genotype data which I want to use for GWAS. The genotype data contains all columns, except allele columns i.e. Ref and Alt alleles. It has all other information, such as chromosome position, chromosome number, and the alleles in my sample etc. It has already been aligned to the reference genome, but I am confused about the Ref and Alt alleles. Is there any way to get it? any software which can extract reference and alternative allele? It is not in any format. Its just a text file. I need to find the alleles for association.

sequencing snp R genome • 122 views
ADD COMMENTlink modified 6 days ago by Emily_Ensembl21k • written 10 days ago by biotechnology4150

Can you please post a small sample of what this data looks like?

ADD REPLYlink written 10 days ago by i.sudbery11k

Sorry, my genotype data looks like this,

Chrom   position    sample1 sample2 sample3                                                                                                                                                                                                                                       
1   1234        AA  TT  TT                                                                                                                                             
1   56545       GG  AG  TT

there is no allele column, I tried to fill NA's to make it HapMap format, and then converted it to VCF by using Tassel. I filled every information column with NA's including the allele column, because allele column is not needed for association in some packages, but it is needed for annotation of gwas results. I need to find alleles before doing GWAS. This is the format I made by filling NA's. It is HapMap format. I converted it to vcf also.

rs#    alleles    chrom    pos    strand    assembly#    center    protLSID    assayLSID    panelLSID    QCcode   sample 1                       
44509    NA    02     5565755     +    NA    NA    NA    NA    NA    NA    AG    AA    AA    AA                                                        
38019    NA    02     43878360     +    NA    NA    NA    NA    NA    NA    GG    GA    GG    GG                                           
89440   NA    04     25220824     +    NA    NA    NA    NA    NA    NA    TC    TT    NN    TT

sorry, I am unable to upload complete picture, there is no option to upload picture.

ADD REPLYlink modified 7 days ago by i.sudbery11k • written 8 days ago by biotechnology4150

Do you mean that in the top sample above you want to know (or example), whether T or A is the REF allele (with the other being the ALT)?

ADD REPLYlink written 7 days ago by i.sudbery11k

Yes, that's what I am trying to find. I tried a method in tassel to find Ref and Alt alleles, but I am not sure whether it is right or wrong. I've converted my text file to HapMap format to make it readable by Tassel, by filling NA's in the allele column, then converted it to VCF. This way, it gives the Ref and Alt alleles. Tassel assign alleles on the basis of allele frequency i.e. major allele as REF allele. Is there any other method, which can accurately find alleles?

ADD REPLYlink modified 7 days ago • written 7 days ago by biotechnology4150

I can think of a couple of ways. Is this human data? If so you can probably use the rsIDs to look up the ref and alt alleles in SNPdb using (I would guess) biomart. Or if its not human, but you have a VCF of the known SNP locations in the genome, you can go thorugh and match them up.

Finally, in the abscence of all that, I'd guess you could write a script to use the chromosome and location to look up what the reference genome sequence is at that position, then mark that as the REF allele and the other as ALT.

ADD REPLYlink written 7 days ago by i.sudbery11k

How many lines do you have of this?

ADD REPLYlink written 7 days ago by Emily_Ensembl21k

thanks, I have 300K snps in my dataset. I cannot search it by rsIDs, because it is a plant data. I think plants don't have rsIDs, like humans snps have. Therefore, I need to look for other ways to find alleles.

ADD REPLYlink written 6 days ago by biotechnology4150

can you tell us what is AA in sample1 or TT in sample2

ADD REPLYlink written 6 days ago by dare_devil1.4k

AA and TT are SNPs in my sample.

ADD REPLYlink written 3 days ago by biotechnology4150
1
gravatar for Emily_Ensembl
6 days ago by
Emily_Ensembl21k
EMBL-EBI
Emily_Ensembl21k wrote:

One option would be to use the Ensembl REST API with the region/overlap endpoint to fetch the alleles. Here's an example with wheat: http://rest.ensembl.org/overlap/region/triticum_aestivum/4A:714193714-714193714?content-type=application/json;feature=variation

You would need to use your favourite programming language to run through your list, run the script and fill in the gaps. I suspect this would take a long time to run.

An alternative with the Ensembl REST API would be to use the sequence/region endpoint to get the reference allele at this locus, and, as @i.sudbury suggests, use this to infer the alt. The benefit of this over the overlap endpoint is that it's also available as a POST endpoint allowing you to query in batches of 50, which would be quicker.

Another way to go about it might be to find the reference VCF for your species and try to use soemthing like VCFTools to intersect between your data and the reference.

ADD COMMENTlink written 6 days ago by Emily_Ensembl21k

Thanks, I understood

ADD REPLYlink written 3 days ago by biotechnology4150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2814 users visited in the last hour
_