Question: Creating a "reference sequence" using velvet?
0
gravatar for jt.bricker
4.8 years ago by
jt.bricker0
United States
jt.bricker0 wrote:

I'm pretty novice when it comes to bioinformatics. I am currently working on a project where I am trying to call SNPs from frog species which have not been completely sequenced. I have RNA from 10 different tissues among 5 closely related frog species.

I've tried aligning these sequences to a reference genome of a more distantly related frog that has been completely sequenced but the results are very poor (2-3% of the reads are mapped to the reference genome).

I thought perhaps I may be able to do a de novo assembly of the RNA sequences of each tissue and use the resulting contigs as a "reference genome" for the alignment.

Does anyone have any experience with this? Is this a reasonable way to call SNPs from data without a reference genome?

ADD COMMENTlink modified 4.8 years ago by pierre.peterlongo840 • written 4.8 years ago by jt.bricker0
1
gravatar for pierre.peterlongo
4.8 years ago by
France
pierre.peterlongo840 wrote:

Hi there,

If you're curious and/or if you cannot achieve a confident enough assembly, I may suggest to try using discoSnp which detects SNPs from raw read sets without a reference genome.

pros: needs no memory - fast - easy to use - 1 to n datasets - similar results than assembly+mapping approaches

cons: snps are output with no genome location (as no reference is provided).

It has not been tested on RNA seq data yet. The ranking score, based on allele coverages will certainly not be efficient. However, we (authors of this work) would be happy to guide you using the tool and analyzing the results.

http://colibread.inria.fr/software/discosnp/

 

 

ADD COMMENTlink written 4.8 years ago by pierre.peterlongo840
1

Thank you for the fantastic tool !

I am working on it currently and have a small doubt with the genotypes output file.

Please see it below:

GENOTYPES_SNP_38543_THRESHOLD_10 0 0 TAAAATATGAGAAATAGTTGTGTTGATATAA/GGGTAAAAGTGTAAGGTTTTTTGAAAATTTA
GENOTYPES_SNP_36100_THRESHOLD_10 -1 0 ACGTCCAAGATCAAGTCGTCACCAGGGATGT/GTGTTCACTGTTGAAGGCTTTGACTTCAGTG
GENOTYPES_SNP_27996_THRESHOLD_10 -1 1 ATCGAACAATTGAAACTGATGTTTCAAGTTC/TAGGCTAAGCAAGAGCGTTTTGCAACCATGA
GENOTYPES_SNP_22993_THRESHOLD_10 0 0 CTCCTTAGATGAAATCTCCTTAAATATACCC/GGGGCTTCCAGCTATTAATATGGAATACCAG
GENOTYPES_SNP_10511_THRESHOLD_10 2 0 AGAGGAGTTTACGGCCCAAGCATATTCTCGA/GGAAATAAACTCATGTTTACTCTCGAAAATG
GENOTYPES_SNP_42798_THRESHOLD_10 2 2 CCCGGATGCAAATCCGATAAGGGCCAGCACC/TGGTGCGTTAGACCTTGTTGATATAGTGAGG
GENOTYPES_SNP_39034_THRESHOLD_10 1 -1 CCATTTTAAGGCCATTTGAGCCAAATCCCCC/TATGAAAGGAAGTAAAGTTCGTAGCTTTACG
GENOTYPES_SNP_67935_THRESHOLD_10 -1 1 ACGGTCCTTATTAGCTGTAAGTGACTAATTA/CCCAATGCGAATCTCGGTAACTCATGAATTA
GENOTYPES_SNP_60582_THRESHOLD_10 -1 1 CAATCATCAAGTGACTCAGACTCCGAGTACT/GTTCCTACTGGTCATAAACAACCCATTCCTA

I am trying to compare 2 samples and ended up with a lot of combinations of genotype coverage values i.e., (2 2) / (2 1) / (2 0)/ (2 -1) / (1 -1) etc.. I understand that I can choose (1 -1) combination. Please let me know what other combinations can I choose?

Many Thanks,

Siva

ADD REPLYlink written 4.7 years ago by sivawarwick10

Hi there,

Thanks for this question Siva.

A value is associated to each SNP and to each sample (explaining why you have 2 values per SNP here while usng two samples).

You've indicated a coverage threshold T to the "genotyper".

Given one sample, for each allele of the SNP you have a coverage. Thus you end with two coverage values per SNP.

 Code 0 indicates that both coverages are < T.
 Code 2 indicates that both coverages are >= T
 Code 1 indicates that only one of the two paths is >=T. Code -1 is symmetrical, depending on which of the two allele is >=T. 

Choosing (1 -1) or (-1 1) is a way to conserve SNPs that are homozygous and distinct in the two samples.
Choosing (2 2) is a way to conserve SNPs that are heterozygous in both samples,

...

 

Best, Pierre
 

 

ADD REPLYlink written 4.7 years ago by pierre.peterlongo840
0
gravatar for RamRS
4.8 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

Hello there,

I followed quite a similar approach for my project. This is a cool approach, but you need to be really confident of your assembly. And there's no objective way because N50 is not a great measure for transcriptome assembly.

Use Velvet/Oases or Trinity, then map back and check coverage. Also ensure you have a reasonable N50. Use GATK's pipeline and BaseRecalibrate/VariantRecalibrate until you see a AnalyzeCovariate run with a good score match to the empirical quality values. Even so, you cannot be a 100% sure.

Lemme know if you have any specific questions. Cheers!

ADD COMMENTlink written 4.8 years ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1094 users visited in the last hour