Question: detailed explanation of Insert Size
2
gravatar for SMILE
2.0 years ago by
SMILE100
SMILE100 wrote:

Hi all,

I have read through many posts about insert size here. And see a very good answer about the insert size.

It is still not so clear for me to understand insert size. I hope some experts can make it clearer.

As illustrated in a good blog and a good anwser, the "insert size"=sequence between adapters (actually encompasses R1 and R2 as well as the unknown gap between them) and it is also known that the ninth column of the SAM file (TLEN) represents the insert size

However, here are some things I still don't understand.

First, in RNA seq data, if the alignments are spliced, and the TLEN reports the distance from the 5'-most to 3'-most position (if my understanding is right). So according to my understaning the TLEN number will include the possible introns which means the TLEN would be unsally longer than "actual insert size"?

Second, if we are mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same?

Third, how Picard tools CollectInsertSizeMetrics actually do to calculate the insert size distribution of a paired-end library, does it only use the TLEN or exclude possible introns?

Any answer to help me better ubderstand this conception will be greatly appreciated.

sequencing rna-seq alignment • 1.4k views
ADD COMMENTlink modified 2.0 years ago by Devon Ryan92k • written 2.0 years ago by SMILE100

This is the best illustration for this: A: What is the different between Read and Fragment in RNA-seq?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax72k

Yes, this is included in the background of my question...

My question is:

First, in RNA seq data, if the alignments are spliced, and the TLEN reports the distance from the 5'-most to 3'-most position (if my understanding is right). So according to my understaning the TLEN number will include the possible introns which means the TLEN would be unsally longer than "actual insert size"?

Second, if we are mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same?

Third, how Picard tools CollectInsertSizeMetrics actually do to calculate the insert size distribution of a paired-end library, does it only use the TLEN or exclude possible introns?

ADD REPLYlink written 2.0 years ago by SMILE100
1

Second, if we are mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same?

Technically fragment length will never be equal to insert size (if you only consider size in bp) since fragment includes insert + Illumina adapters. If the DNA fragment does not contain a breakpoint/translocation then it would represent a contiguous stretch of DNA in genome.

I will let someone else tackle #1 and 3.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax72k

If you are interested in insert size calculation then use these directions (for BBMap tools).

ADD REPLYlink written 2.0 years ago by genomax72k

Thank you for your advice, I will give it a try.

ADD REPLYlink written 2.0 years ago by SMILE100
3
gravatar for Devon Ryan
2.0 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

So according to my understaning the TLEN number will include the possible introns which means the TLEN would be unsally longer than "actual insert size"?

Yes, the TLEN field won't always be terribly useful in RNAseq. When trying to compute the original fragment sizes it's best to not have spliced fragments. Back when we used tophat2 in our production pipeline, our "insert size estimation" step aligned to the transcriptome to avoid this problem.

Second, if we are mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same?

This will depend a bit on which fragment you're talking about. See the comment from genomax.

how Picard tools CollectInsertSizeMetrics actually do to calculate the insert size distribution of a paired-end library, does it only use the TLEN or exclude possible introns?

I've never run that tool on RNAseq data, I'm not sure how useful it would be. I would expect that it's just summarizing the TLEN field, so I'd expect some absurdly high mean values.

ADD COMMENTlink written 2.0 years ago by Devon Ryan92k

Thank you for your clear explanation!

ADD REPLYlink written 2.0 years ago by SMILE100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1970 users visited in the last hour