Target fragment size versus final insert size
1
0
Entering edit mode
5.6 years ago

I'm getting a bit confused after having run an aligner program on my RNA-seq data (BWA for Burrows-Wheeler Aligner) in order to get an estimate of the mate inner distance for use in downstream analysis. The total RNA libraries were prepared with universal Illumina adapters, it was a 150 bp PE sequencing. BWA gave me an average insert size of ~250 +/- 60 bp, and the sequencing company gave me a target fragment size value of 394. Adapters are 34 bp long.

Here is how I understand things so far, please correct me if I'm wrong.

1) Does the fragment size target value of 394 includes the 3' and 5' adapters? I would think so.

2) Do the read length include adpaters? I would say no, because quality checks on the clean data (ie adapters trimmed) give a read length of exactly 150 bp (shouldn't it be 150 - 2x34 = 92 bp after the adapters are removed if they are 34 bp long and included in the read length?)

3) Given the two previous points, that would give an insert size of 394 - 2x34 = 326. Which is different from the BWA estimate. Is it usual to have a large difference between a target fragment size and the actual size of the fragments that are sequenced? And if I have an insert size of ~ 250 bp and PE reads of 150 bp, that means that the left and right reads overlap in the middle?

RNA-Seq alignment fragment size • 4.9k views
0
Entering edit mode

1) yes, what they gave you is probably the result of a Bioanalyzer gel electrophoresis, which measures the full length of the DNA fragment that is pipetted onto the flowcell: That is, the actual genomic fragment + the Illumina adapters + P5/P7 sequences that are required for flow cell binding (total adapter content should therefore be somewhat 120bp). Check this picture for an idea of how this looks.

2) read length is simple the number of base calls that the sequencer performs, so it sequences 150bp into the fragment. in case your fragment (the actual genomic fragment) is shorter than that, also parts of the adapter will be sequenced, requiring adapter trimming. You can check that with e.g. FastQC, followed by trimming with e.g. bbdup, skewer, cutadapt.

3) the insert size that the aligners report are solely based on the distance between the two mate pair alignments towards the reference genomes. adapter sizes and read lengths play no role here (given that you properly trimmed all adapter content).

0
Entering edit mode
5.6 years ago
mastal511 ★ 2.1k

1) You should ask the sequencing company to specify whether the fragment length they quoted includes adapters or not.

3) Yes, that would mean that some of your reads should overlap in the middle of the fragment.

1
Entering edit mode

I generally concur with mastal, but I guess I'm a little stricter. I would completely ignore what the company told you and trust the results of alignment. Adapters are NOT part of insert size. But, yes, it is very typical to have lab-estimated insert sizes and accurately-measured insert sizes by alignment differ. When they differ, the lab-estimated insert size is wrong.

2
Entering edit mode

Incidentally, using the BBMap package, you can quantify your insert-size distribution in an alignment-free way with BBMerge:

bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist.txt


This is only useful for overlapping reads, which in your case, appear to be the majority. If you have enough RAM, you can do this:

bbmerge-auto.sh in1=r1.fq in2=r2.fq ihist=ihist.txt prefilter=2 rem extend2=100 k=62


...which will also quantify the insert-size distribution of non-overlapping reads, if they have sufficient coverage. If you want to do it via mapping rather than overlap, you could do this:

bbmap.sh ref=ref.fasta in1=r1.fq in2=r2.fq ihist=ihist.txt


-Brian

0
Entering edit mode

Hi Brian,

Thanks, I haven't explore other functions yet but indeed BBmap is doing a good job to give insert size, fast and easy. I ran a alignment-free version to compare it with the genome-guided one. The former gave me an insert size of 219 +/- 40 while the genome-guided one gives a value of 269 +/- 555. When using a reference genome, the SD is really large because of large inserts (8000 to 30 000+ bp) that don't show up when using the alignment-free method (max insert size is much smaller than 30 000, ie 291). FYI, BWA (with reference genome) gave me an insert size of 249 +/- 62.

After running trials with TopHat, the --mate-inner-distance parameter doesn't seem to change the output a lot (even using drastically different values, ie positive and negative ones with either large or small SDs), so I'm not too worried. But I'd like to know what value you'd trust, and why they are so different. Thoughts? Maybe that's relevant regarding the reference quality: I'm working on the Argentine ant, whose draft genome was published in 2011 (with no update since then).

Thanks, Ant1

0
Entering edit mode

This is a tricky question, because you are using RNA-seq, which has introns. BBMerge has a maximum of ~291 bp insert size for 2x150bp libraries because it will only calculate the value from reads that overlap; however, the number you get is not affected by introns. BBMap is affected by introns but does not require the reads to overlap. Depending on the genome and coverage, you would probably get the most accurate results with the other command I mentioned:

bbmerge-auto.sh in1=r1.fq in2=r2.fq ihist=ihist.txt prefilter=2 rem extend2=100 k=62


This is slower and uses more memory than normal BBMerge, but it does not require the reads to overlap. Also, I've noticed that even though the average is affected by whether or not reads overlap, the mode generally is not (it probably won't be, in this case).

But for a succinct answer... I would trust BBMap over BBMerge for the average, but to reduce the impact of introns, I suggest you rerun it like this:

bbmap.sh ref=ref.fasta in1=r1.fq in2=r2.fq ihist=ihist.txt pairlen=1000


That will eliminate the really long inserts due to introns that are bloating the average and SD. On the other hand, you could also simply use BBMap instead of TopHat in the first place, since it is also a splice-aware aligner, and is generally faster and more accurate than TopHat :)

1
Entering edit mode

Hi Brian,

Just letting you know that I ended up having to use another aligner than BBmap, as the BAM output doesn't have the right CIGAR alignment formatting to be taken as input by Trinity (rna-seq). Have a look at this thread for more details: https://groups.google.com/forum/#!topic/trinityrnaseq-users/P54ubKYb7MQ

Cheers, Antoine

0
Entering edit mode

Ah, thanks for letting me know; I was unaware that Trinity does not support the current SAM specification. You can make BBMap print Trinity-compliant cigar strings with the flag "sam=1.3" (which reverts it back to the older Sam 1.3 specification). Alternately, if you already have bam files, you can convert them like this:

reformat.sh in=mapped.bam out=converted.bam sam=1.3

-Brian

0
Entering edit mode

I ran BBMap (with BBwrap) to align my PE reads to a reference genome, which gave me a single sorted bam file as an output. Then I'm using reformat.sh as above to make Trinity work with it. However, a warning message got my attention: "Input is being processed as unpaired" while I originally had paired end reads. Will this affect how reformat.sh works with the data?

0
Entering edit mode

Oh, don't worry about that... reformat doesn't care in this case whether the reads are paired or not. They will be processed independently - so if you discard one read, its mate will not be discarded, but you're not discarding anything so it doesn't matter.

0
Entering edit mode

The BBmerge-auto command with your parameters gave an average insert size of 245 +/- 62. The BBmap with pairlen=1000 gave a value of 248 +/- 72. Similar values, and close to the BWA estimate. Sounds good to me! Thanks Brian, I will look into the rest of the package ;-)