Determining Paired-end or Mate-pair insert length in De Novo Sequencing
3
0
Entering edit mode
7.7 years ago

Hello,

I have a few questions regarding mate-pair and paired end sequences:

  1. Does one have to know the exact length of the insert for the paired end or mate-pair sequences to be useful? (I'm not sure if "insert length" is the proper term to use in mate-pairs, but it seems to be the same concept as insert length in paired ends unless I am mistaken).
  2. From what I have read, usually, people obtain the length of the insert by aligning the paired end to a reference genome. Doesn't that kind of defeat the whole purpose or usefulness of the paired-end sequences, because to align to a reference genome, you generally have to treat the paired-end as two single reads. Or do you also have some knowledge of the *approximate* length of the insert (in which case, I can see the usefulness)?
  3. How is this done in de novo sequencing, when you don't even have a reference sequence?

Thanks a bunch!

De novo paired end mate pair • 2.7k views
ADD COMMENT
1
Entering edit mode
7.7 years ago
Biogeek ▴ 470

Hey,

You can use a nice package by a guy who is on here - Brian Bushnell (if I remember correctly). It's called BBmerge and if you google it, you can find out the syntax for calculating paired end insert size for reads - to my mind it doesn't need a reference genome. I had to do this a few weeks ago when I was trying out SOAPtrans which asked for an insert size.

Hope that helps you.

ADD COMMENT
0
Entering edit mode

Hi,

Thanks! I'll make sure to check it out.

Regards.

ADD REPLY
0
Entering edit mode
7.7 years ago
Asaf 10k
  1. There is no exact size, it's a distribution. This distribution is useful to know how many N's to insert when scaffolding.
  2. The de-novo assembly is usually done without the knowledge of paired-end, each side is treated as if it's a single end to generate contigs. You should get long enough contigs to be able to map both ends of a fragment to estimate insert size (even with mate-pair).
ADD COMMENT
0
Entering edit mode

Hi. Thanks a bunch. So in response to (1), I wonder how paired-ends can be useful to align to repetitive regions in the reference genome, for example when one end of the pair is in a non-repetitive region and the other is in the repetitive region? On one hand, it seems that precision in aligning within repetitive regions might not be as important, but if an SNP repeatedly occurs in a specific region of a repetitive region, then precise alignment would be useful to determine where exactly the mutation is. Thanks once again!

ADD REPLY
0
Entering edit mode
7.7 years ago
Charles Plessy ★ 2.9k

1) Does one have to know the exact length of the insert for the paired end or mate-pair sequences to be useful?

It is important to have an estimate, so that the aligner can distinguish between _"proper"_ pairs that are likely to truely represent the molecule they originate from, and the artefacts where one mate is misaligned, usually very far from the other mate. How much "far" means depends on the method. For instance, in transcriptome sequencing, it is expected that some proper pairs will align hundreds of kilobases apart, and short read aligners such as BWA need to know that.

2) ... do you also have some knowledge of the approximate length of the insert ?

First, as explained above, the distribution observe lengths after alignment will differ according to the kind of sequencing method (transcriptome, genome, ...). In addition, for genome sequencing, the sequencing templates can be prepared in such a way that the distance after alignment should be within a given range.

3) How is this done in de novo sequencing...

De novo assembly typically takes advantage of the prior information on what the distance between the mates should be, in order to sort the contigs, predict gap size, etc.

ADD COMMENT

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6