Question: Snp Detection And Genotyping In Ngs (Next Generation Sequencing)
4
gravatar for Eric Normandeau
8.6 years ago by
Eric Normandeau10.0k
Quebec, Canada
Eric Normandeau10.0k wrote:

Hi,

I am working on a few projects right now using 454 cDNA data. For these projects, we have made the sequence libraries ourselves in order that each (in fact, about 97%) of the sequences posses a tag that allows us to identify the individual from which it comes. We can have, let's say, 20 individuals, all tagged using 10 coded nucleotides that are placed at the beginning of the sequence by being part of the primers.

I then scan through the .fasta file using a Python program I made to change the names of the sequences to represent the tag found in the sequence (eg: Tag01) and remove any primers present. With these sequences, we do a de novo assembly, and then we re-align the sequences to the consensus contigs created in the de novo part. We then export the alignements in ACE format, which I parse using Biopython to FASTA format. Then, the fun begins :)

Using a list of SNP positions and contig numbers, I extract the genotypes from each sequence with another Python program I made. I'll then have to parse my result to do the rest of the job (stats and further analysis). I also do sequence counts for each of the tags in order to do gene expression (remember this is from cDNA) with these data.

My question is. How would you do the SNP individual genotyping? I really like the Python coding part, but I wonder if there is not an already made solution for just that.

Thanks!

ADD COMMENTlink modified 8.6 years ago by User 927610 • written 8.6 years ago by Eric Normandeau10.0k

Are you sure this is correct? "With these sequences, we do a de novo alignment" I think you mean assembly?

ADD REPLYlink written 8.6 years ago by Michael Dondrup45k

Indeed. Corrected!

ADD REPLYlink written 8.6 years ago by Eric Normandeau10.0k
2
gravatar for Mikael Huss
8.6 years ago by
Mikael Huss4.6k
Stockholm
Mikael Huss4.6k wrote:

This is perhaps a bit convoluted, but if you could get your assembly into SAM format, running SAMTools with the pileup subcommand and then VarScan on the pileup output will help you deal with SNP detection and stats.

I am not sure how to get from ACE to SAM, but this SeqAnswers thread seems to contain a solution for that.

ADD COMMENTlink written 8.6 years ago by Mikael Huss4.6k

But it seems that while varscan can give you differences between two pools of individuals, it is unable to output table with each individual's genotype at each snp

ADD REPLYlink written 8.6 years ago by Yannick Wurm2.3k
2
gravatar for Yannick Wurm
8.6 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

The following article and accompanying code (get the latest version from her website) aims to do what you want: output should be a table of individuals/genotypes (you first need to trim TAGs and separately map each individual's reads to "reference".

http://genome.cshlp.org/content/20/4/537.abstract

However, I haven't gotten it to work yet, perhaps because my dataset is large. I'll keep you posted on how things work out

ADD COMMENTlink written 8.6 years ago by Yannick Wurm2.3k

Thanks. I downloaded the code and will look into it.

ADD REPLYlink written 8.6 years ago by Eric Normandeau10.0k
2
gravatar for Casbon
8.6 years ago by
Casbon3.2k
Casbon3.2k wrote:

If the tag is at the beginning of the read, you can use sfffile with an MID file to split the reads. Then map each individual using runMapping and it will give you the SNP calls. More about this in the 454 software manual.

If you go another route, (e.g. samtools) you will find your data is infested with false indels due to the 454 platform's error modality.

ADD COMMENTlink written 8.6 years ago by Casbon3.2k
1
gravatar for User 9276
8.4 years ago by
User 927610
User 927610 wrote:

I have Python software to convert Newbler's output to SAM/BAM, decode MIDs robustly, mark duplicate reads in flow-space, call individual variants, output genotypes for all individuals in a set, and annotate the variants with simple functional information. The code has yet to be formally released, but please feel free to contact me at jacobs@bioinformed.com if you'd like to give it a try. The majority of the code is Python, though with liberal use of Cython www.cython.org) to improve performance.

ADD COMMENTlink written 8.4 years ago by User 927610
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 786 users visited in the last hour