Question: Trinity Assembly Filtering And Paralogs In Hybrid Capture Experiments
3
gravatar for compatcg
5.9 years ago by
compatcg50
compatcg50 wrote:

(cross-posted on SEQanswers: http://seqanswers.com/forums/showthread.php?t=32017)

Greetings Everyone,

I am working with a group that does population genetics on non-model species (the closest reference genome is usually at least 5%-10% divergent). We are just starting to move into NGS with the following general approach:

  1. Gather transcriptomic data through RNA-seq or EST databases
  2. Using transcriptome data, design hybrid capture bait sets (e.g. MYcroarray MYbaits) for several thousand transcripts
  3. Enrich exons and flanking intronic regions using the above bait set for hundreds of individuals and sequence (HiSeq and/or MiSeq)
  4. SNP calling
  5. Pop gen analyses

For a particular experiment, I have RNA-seq data from three individuals from which I want to design hybrid capture baits. I've de novo assembled the transcriptomes of these individuals, and I'm now picking transcripts to use for the baits.

My question for everyone: what can I do at this stage to reduce/eliminate enriching paralogous genes? Does anyone have a strategy for filtering at this stage that he or she could share (perhaps based off of blast E-values or sequence similarity)?

One thing I have considered is taking the final Trinity.fasta files and simply removing all components that have more than one contig/sequence. So for the hypothetical dataset below I would keep component 5 and throw out components 2 and 6.

>comp2_c0_seq1 len=3 path=[354:0-2]
CAT
>comp2_c1_seq1 len=6 path=[972:0-5]
ATTCAC
>comp5_c0_seq1 len=8 path=[629:0-7]
GGGCTTGA
>comp6_c0_seq1 len=5 path=[449:0-4]
CCAAC
>comp6_c0_seq2 len=8 path=[225:0-7]
GATACGGG

Is this a potentially-valid approach? One concern I have with this is that, unless I'm mistaken, multiple sequences for a single component may represent allelic variation as well as potential paralogs and isoforms, so this approach might reduce the number of resulting SNPs after the pulldown experiment.

If the goal is simply to find a bunch of markers to get a bunch of SNPs for population genetic analyses (while reducing/eliminating paralogs), what would you do?

Thanks so much for your help in advance!

trinity rna-seq assembly • 2.0k views
ADD COMMENTlink modified 5.8 years ago • written 5.9 years ago by compatcg50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour