Question

Working With Snp Data From Ngs & Exon Sequencing

0

Entering edit mode

11.8 years ago

romsen ▴ 70

Hello,

I have access to SNP and genotype information from NGS data, particularly Exon-Seq reads of 56 samples. For every sample a variant file exists in two formats: (.gff3 and kind of tab delimited file). Data include e.g. rsID, pos, CHR, refAllele, quality Score...

This means 56 files each with more than 400.000 SNPs. I know several tools for SNP data processing (plink, imputation stuff) but have no idea how to use them for this kind of data. Perhaps you can help me and suggest some tools to create eg. ped/map files or generally one genotype file for 56 samples of selected or all SNPs.

Are there standardized tools, at all? Or one has to use R, Unix &Perl commands to cut, combine and work with such data?

Thanks

snp sequencing ngs • 3.5k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 11.8 years ago by romsen ▴ 70

1

Entering edit mode

Before getting into file conversion etc., What do you want to do with these data? Are these diseased individuals? Are you just trying to learn about NGS variant analyses? With a few more details you will get the answer you are looking for.

ADD REPLY • link 11.8 years ago by Zev.Kronenberg 12k

1

Entering edit mode

precisely. first comes the "what?", and then the "how?". and not in the other way round.

ADD REPLY • link 11.8 years ago by Jorge Amigo 14k

0

Entering edit mode

Can you post a snippet of the tab delimited file so we can see how it's structured? It could be VCF (though you'd likely have noticed the header). BTW, you can convert GFF3 into VCF (see this thread: Converting a SNP GFF3 file to VCF format) and then convert that into .ped and .map with vcftools if nothing else.

ADD REPLY • link updated 6.4 years ago by Ram 45k • written 11.8 years ago by Devon Ryan 105k

Ram · Answer 1 · 2013-10-10

I find Zev's comment very important, as it's not that rare to find people coming to you with NGS data, eyes wide open, even sweating by the challenge they're facing, and asking: "now, what?". the high-throughput genotyping field has already defined some very interesting approaches for extracting association and linkage knowledge from such amount of data, but the one of the most interesting strengths of NGS may the variant discovery capability, that allows us to work with really rare variants but in very large numbers. first you'll have to think what question do you want to ask to your data, and then you'll have to find out how to build that question on a computer. in fact, the question should have been defined before deciding going into NGS, but that'd be another story.

If you are talking about how to deal with .gff3 variants (are we talking about SOLiD LifeScope's?) the best suggestion I can think of is to annotate them (with ANNOVAR for instance), which will allow you to deal with them later as tabulated files with enriched information. and if you want to get knowledge from all those (and the newly generated ones) tabulated files at once, instead of extracting the information sample by sample yes, you will definitely need to create a tool to process them. if it's just for simple operations like combining, merging, overlapping,... scripting would do. if you want to go beyond that, extracting conclusions from statistical inferences, then you'll certainly would have to think about dirtying your hands with R.