Question

Snp Detection And Genotyping In Ngs (Next Generation Sequencing)

4

Entering edit mode

14.3 years ago

Eric Normandeau 11k

Hi,

I am working on a few projects right now using 454 cDNA data. For these projects, we have made the sequence libraries ourselves in order that each (in fact, about 97%) of the sequences posses a tag that allows us to identify the individual from which it comes. We can have, let's say, 20 individuals, all tagged using 10 coded nucleotides that are placed at the beginning of the sequence by being part of the primers.

I then scan through the .fasta file using a Python program I made to change the names of the sequences to represent the tag found in the sequence (eg: Tag01) and remove any primers present. With these sequences, we do a de novo assembly, and then we re-align the sequences to the consensus contigs created in the de novo part. We then export the alignments in ACE format, which I parse using Biopython to FASTA format. Then, the fun begins :)

Using a list of SNP positions and contig numbers, I extract the genotypes from each sequence with another Python program I made. I'll then have to parse my result to do the rest of the job (stats and further analysis). I also do sequence counts for each of the tags in order to do gene expression (remember this is from cDNA) with these data.

My question is: How would you do the SNP individual genotyping? I really like the Python coding part, but I wonder if there is not an already made solution for just that.

Thanks!

genotyping next-gen-sequencing python snp • 8.3k views

ADD COMMENT • link updated 6 months ago by Ram 44k • written 14.3 years ago by Eric Normandeau 11k

0

Entering edit mode

With these sequences, we do a de novo alignment

Are you sure this is correct? I think you mean assembly?

ADD REPLY • link updated 6 months ago by Ram 44k • written 14.3 years ago by Michael 54k

0

Entering edit mode

Indeed. Corrected!

ADD REPLY • link 14.3 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2010-04-10

2

Entering edit mode

14.3 years ago

Mikael Huss 4.8k

This is perhaps a bit convoluted, but if you could get your assembly into SAM format, running SAMTools with the pileup subcommand and then VarScan on the pileup output will help you deal with SNP detection and stats.

I am not sure how to get from ACE to SAM, but this SeqAnswers thread seems to contain a solution for that.

ADD COMMENT • link updated 8 months ago by Ram 44k • written 14.3 years ago by Mikael Huss 4.8k

0

Entering edit mode

But it seems that while varscan can give you differences between two pools of individuals, it is unable to output table with each individual's genotype at each snp

ADD REPLY • link 14.3 years ago by Yannick Wurm ★ 2.5k

Ram · Answer 2 · 2010-04-23

2

Entering edit mode

14.3 years ago

Yannick Wurm ★ 2.5k

The following article and accompanying code (get the latest version from her website) aims to do what you want: output should be a table of individuals/genotypes (you first need to trim TAGs and separately map each individual's reads to "reference".

http://genome.cshlp.org/content/20/4/537.abstract

However, I haven't gotten it to work yet, perhaps because my dataset is large. I'll keep you posted on how things work out

ADD COMMENT • link updated 8 months ago by Ram 44k • written 14.3 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

Thanks. I downloaded the code and will look into it.

ADD REPLY • link 14.3 years ago by Eric Normandeau 11k

score 2 · Answer 3 · 2010-04-26

If the tag is at the beginning of the read, you can use sfffile with an MID file to split the reads. Then map each individual using runMapping and it will give you the SNP calls. More about this in the 454 software manual.

If you go another route, (e.g. samtools) you will find your data is infested with false indels due to the 454 platform's error modality.

score 1 · Answer 4 · 2010-07-05

I have Python software to convert Newbler's output to SAM/BAM, decode MIDs robustly, mark duplicate reads in flow-space, call individual variants, output genotypes for all individuals in a set, and annotate the variants with simple functional information. The code has yet to be formally released, but please feel free to contact me at jacobs@bioinformed.com if you'd like to give it a try. The majority of the code is Python, though with liberal use of Cython www.cython.org) to improve performance.