Estimaing insert size from a pair-end library using bbmap
0
2
Entering edit mode
2.0 years ago

Hi biostars.

I'm trying to estimate the insert size of a pair-end illumina library. I have limited information about library preparation and sequencing protocol (2x125). To estimate the insert size I used mapped the reads to a genome using bbmap. The reads were processed using trimmomatic (to remove adaptors and only keep reads with size = 125).

java -jar trimmomatic-0.39.jar PE -threads 10 -phred33 R1_001.fastq.gz R2_001.fastq.gz R1.adaptor.cut.fastq.gz R1.unpaired.fastq.gzip R2.adaptor.cut.fastq.gz R2.unpaired.fastq.gzip NexteraPE-PE.fa:2:30:10 MINLEN:125

bbmap.sh ref=cro_genome.fasta in1=R1.pair.fastq.gz in2=R2.pair.fastq.gz -ihist=ihist.txt reads=5000000


From the result file:

Mean 427.278

Median 295

Mode 234

STDev 536.271

The mean value seems to correspond to expected values. However, I find the std deviation value a little bit too high. It this the typical standard deviation in a pair-end library?

RNA-Seq bbmap • 1.1k views
1
Entering edit mode

I assume you left out the reference option (ref=) in your bbmap.sh command? I suggest that you do the insert size estimation using original data. Use one of the three methods @Brian mentions here.

0
Entering edit mode

Yes I forgot to paste the ref= option. Why use original data instead of data without adaptors? (just for reference, only a small amount of reads pairs are discarded after trimmomatic filters by size.

I seems however that the pairlen=2000 option is important. Here is the reestimation:

Mean 392.938

Median295

Mode 232

STDev 317.279

1
Entering edit mode

Is the reference genome a high-quality, chromosome-level genome? Or is it a draft genome, with hundreds or thousands of contigs? Is the reference genome from the same species as the reads, or a close, but different species? What are the mapping stats?

0
Entering edit mode

Its a draft genome still, with 2090 contigs, from the same species. BBMAP is able to map around 95% of the reads. I have obtained the same values with illumina

0
Entering edit mode

Using the merge option gives a good estimate of insert size (in case of long reads) so having the adapters allows identification of fragments which may have small inserts. pairlen= is the max distance a read pair would allowed to be mapped apart to be considered properly paired. See if tightening that number further improves the stats.

0
Entering edit mode

okay I tested two settups:

Original untrimmed data with merge:

bbmerge.sh in1=R1.untrimmed.fastq.gz in2=R2.untrimmed.fastq.gz reads=2m ihist=ihist.txt


Pairs: 2000000

Joined: 674961 33.748%

Ambiguous: 1323904 66.195%

No Solution: 1135 0.057%

Too Short: 0 0.000%

Avg Insert: 170.3

Standard Deviation: 51.3

Mode: 232

Despite the good metrics only 33% of the reads overlapped. I tried to decrease the pairlen option to 400 with the bbmap above:

Mean 298.182

Median 277

Mode 229

STDev 118.369

PercentOfPairs 98.642

1
Entering edit mode

I would not worry about the STDev since you don't have a good reference with a small number of chromosomes. Your insert sizes otherwise should be good with the mapping protocol you did above. I suggest that you go with that value i.e. mean of around 400 bp.

0
Entering edit mode

Hi again. It seems I forgot one important detail in the question. The data comes from RNA-seq data. I believe since there could be introns, the min length should be decreased. Or maybe should I used an entirely different tool. Sorry about that it has been a long day

0
Entering edit mode

Min length for what? You are doing fine. No need to be sorry.

0
Entering edit mode

I meant the pairlen= option.

I've checked this link here http://seqanswers.com/forums/showpost.php?p=158099&postcount=2 and I ran the the bbmap command as follows:

bbmap.sh ref=cro_genome.fasta in1=R1.pair.fastq.gz in2=R2.pair.fastq.gz -ihist=ihist.txt reads=1M maxindel=200000 pairlen=500


I'm not sure if running with both those options makes sense in this case.

I also aligned the same sample to my genome using hisat2 and used samtools stats:

hisat2 -q --dta -p -x cro_index -1 R1_cut_paired.fastq.gz -2 R2_cut_paired.fastq.gz -S idio_1.sam
samtools view -Su idio_1.sam | samtools sort -o idio_1_sorted.bam
samtools stats


In both cases, the estimated average insert was 315 and the std was 145. I've decided to accept these values for the following analysis for now