Question

Get Average Insert Size Of Fastq?

1

Entering edit mode

11.7 years ago

dan79 ▴ 90

Is there a way to do it? Sorry for the uninformative question, so I have downloaded an SRA file from NCBI and used included sratoolkit to split the file into two fastq sequences. I am trying to do a de novo assembly using these paired-end strand_specific reads. However, a required parameter is the average insert size. Does anyone know how to obtain this from an SRA file or fastq?

fastq • 13k views

ADD COMMENT • link updated 11.7 years ago by matted 7.8k • written 11.7 years ago by dan79 ▴ 90

0

Entering edit mode

Please describe your question so people can help you. I think I understand what your asking, but without more information it is difficult to answer.

ADD REPLY • link 11.7 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Edited, thanks.

ADD REPLY • link 11.7 years ago by dan79 ▴ 90

0

Entering edit mode

You will need to align the reads (both pairs). Then you can find the insert lengths by parsing the SAM/BAM file.

ADD REPLY • link 11.7 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Align the reads to a reference genome? This seems counterintuative considering the whole point of a de novo assembly is to not need a reference.

ADD REPLY • link 11.7 years ago by dan79 ▴ 90

0

Entering edit mode

Good point. Sorry. I need to read more carefully. I don't know the answer. I look forward to seeing the best solution.

ADD REPLY • link 11.7 years ago by Zev.Kronenberg 12k

score 4 · Answer 1 · 2012-08-08

Guessing an insert size length, assembling, mapping to the assembly, and then iterating with the improved insert size length (from the mappings) is a reasonable choice, and probably about the best you can do. You hopefully should have some rough idea from the library preparation method (size selection criteria or if it's jumping library or not).

In fact, Velvet does this automatically (from the 1.1 manual): "If the insert length of a library is unspeciﬁed, Velvet will attempt to measure it for you, based on the read-pairs which happen to map onto a common node." As they point out, it's critical to check the reported estimate to make sure it's sane.

score 2 · Answer 2 · 2012-08-08

2

Entering edit mode

11.7 years ago

dfornika ★ 1.1k

I'm going to suggest a lazy, imperfect solution. If this is illumina (Genome Analyzer, HiSeq etc.) then th insert size is normally about 300bp. If your assembler isn't too sensitive to that parameter, try 300bp as a reasonable guess.

ADD COMMENT • link 11.7 years ago by dfornika ★ 1.1k

0

Entering edit mode

Haha, well its better than nothing. I read that somewhere too, yes the sequencer was an Illumina. I already started the job with 300 insert size. +1

ADD REPLY • link 11.7 years ago by dan79 ▴ 90