Question: BBmerge and gaining insert size / SOAP question
0
gravatar for Biogeek
2.7 years ago by
Biogeek340
Biogeek340 wrote:

Hi guys,

Just looking for a bit of advice.

I have concatenated forward read and concatenated reverse paired end read files for input into SOAPtrans de novo. I have been using Trinity to date and have been unaware of the insert size setting which comes with SOAP config file.

I have a few questions:

Using the concatenated read files (previously cleaned using Trimmomatic), I obtained an average insert size of 124bp (123.56bp) with the BBtools script BBmerge ( as outlined http://seqanswers.com/forums/showthread.php?t=43906 ) for my Illumina Hiseq 2500 data. Is this a normal insert size? I think Truseq chemistry kit was used.

Secondly, just to check with people that have more experience than myself;I am assuming it doesn't matter if all my forward reads are joined together for this calculating the insert size? I read about overlapping of reads. Having concatenated all my Fwd / Rev reads, surely this won't have any affect or cause an erroneous insert size reporting?

Thirdly, for SOAPtrans, the map_len (map length) is set to 32 default. Whilst I have changed the other settings to suit my insert size (124bp), and max read size of 88 (trimmed from 100bp in trimmomatic), should I also be changing map length?

Thank you.

de novo bbtools soaptrans bbmerge • 1.0k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Biogeek340
2

For a WGS dataset an average insert size of 124 bp is small. In any case it is a characteristic of your library and if real, nothing short of making new libraries is going to fix it.

It does mean though that your R1/R2 reads probably overlap for a major fraction of the reads (how long are these reads, 100 bp?). As long as you concatenated R1/R2 files in exactly the same order (and the reads in the files were in correct order) the concatenation should be ok. You can always use repair.sh from BBMap to confirm if the reads are in proper order in the final file.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax65k

Hi,

Yes, I concatenated the forward files exactly the same as the reverse files. So they should all be in order. It was RNA-Sequencing.

Reads were 101bp off an Illumina 2500 Hiseq run. Using Trimmomatic the reads are now 25-88bp in size after getting rid of the hexamer bias noise at the start of the reads and adaptors.

I have used Trinity to assembled (kmer 25), and I am planning to run SOAPtrans with kmer=39, 59 and 79. Once complete, I wish to apply the Traacds pipeline to get the most evidentially coded transcriptome assembly. Would you recommend any thing to look out for, or not do use? You say that the overlap may be a concern. Would this mean sticking to small K-mer size?

Thanks.

ADD REPLYlink written 2.7 years ago by Biogeek340
1

after getting rid of the hexamer bias noise at the start of the reads

You probably threw away some good data. I would suggest going back and removing just adapters starting with raw data.

Even for RNAseq that insert size is small. How many contigs did you get with Trinity (thousands)? Did you do any checking with blast etc to see how good they were?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax65k

Approx 1 Million contigs; had 12 samples, all about 25MBP in size. I usually do redundancy removal after with CD-HIT-Est etc. I ended up with about 60,000 unigenes. High percentage had matches (80%). It seems the consensus of trimming has completely reversed. When I was new to all of this, stringent trimming seemed to be the norm, but now it seems little to no trimming at all is best. Do you think them extra 13bp per read will be valuable?

ADD REPLYlink written 2.7 years ago by Biogeek340
1

If you are happy with the results from Trinity then no need to do anything. Is there a reason to try SOAPtrans if the trinity results were acceptable? Is this a novel genome or something known?

BTW: 13 bp x millons of reads could add a significant chunk of bp to total data, if you ever end up re-doing the analysis.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax65k

genomax. There is a draft genome, but it's highly fragmented. I wanted to get the best out of my data, so I have used Trinity de novo. it works great as always. I also carried out a genome guided reference based assembly with Trinity and plant to merge the two.

Perhaps there is no point of also doing other assemblers then merging, as this will create more variability and noise in the final transcriptome?

ADD REPLYlink written 2.7 years ago by Biogeek340
1

Tough question to answer. If assembly in hand is good enough for what you intend to use it for then perhaps it is good to stop here.

ADD REPLYlink written 2.7 years ago by genomax65k

Hi Genomax.

I've now gone back despite the pain it's gonna cause to capture more of the raw reads. I read the paper by Hanson, 2010 and Macmanes, 2014 and had a look around to see what everyone else is doing. I think it will be suffice to work with the reference guided and de novo trinity assemblies merged, as I'm getting the responses in the data I want to see. I think that using multiple assemblers is maybe a bit advanced for this work. I appreciate all your advice. It's shown me a few pit traps.

ADD REPLYlink written 2.7 years ago by Biogeek340

Merging and concatenating are not the same thing. How does BBMerge figure in this question? You have it in the title of the post but then nothing references it in the body of the post.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax65k

sorry, edited... Hopefully it's a bit clearer now.

ADD REPLYlink written 2.7 years ago by Biogeek340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 798 users visited in the last hour