weird insert size post trimming
4
3
Entering edit mode
5.5 years ago

I am analyzing rna seq data from illumina stranded protocol 126bp -PE , sequences have nextera adapters using cutadapt I trimmed all the sequences but now I have reads with different lengths, also my insert size form picard has weird peaks, find below has any one experienced this before?

EDIT

Apologies I did little more digging around the pipeline and found out that flexbar was used to remove adapters

and the command used was

log file leeHom

Total reads :231670526  100%
Merged (trimming) 22238730      9.59929%
Merged (overlap) 0      0%
Kept PE/SR 93263449     40.2569%
Trimmed SR 0    0%
Failed Key 0    0%

RNA-Seq • 4.8k views
0
Entering edit mode

Please give the full command for cutadapt. I saw this kind of odd shape before, but it turned out that adapters were improperly trimmed. What are the Nextera sequences that you trimmed for?

0
Entering edit mode

Agreed, this graph indicates 2 things:

1) incomplete adapter trimming 2) incorrect insert size calculation

The gap is present because reads with only a few (say, under 10 or so) adapter bases are not getting trimmed. Then when they map and appear to overlap the insert size is not calculated correctly. I suggest using a different method for both trimming and insert-size calculation.

0
Entering edit mode

updated post, with more details

0
Entering edit mode

Also be sure that the adapter file is correct. CTGTCTCTTATACACATCTCCGAGCCCACGAGAC is what you have to trim for in case that the Nextera adapter was used for library prep (on both strands, same adapter sequence).

0
Entering edit mode

I have edited my post, looks like i have two diffrent adapters and not the same one

0
Entering edit mode

They have been trimmed using these adapters and flux bar command

0
Entering edit mode

Any chance you might post part of the data somewhere on an ftp or dropbox?

0
Entering edit mode

Sorry I wish I could..

0
Entering edit mode

could you try to run leeHom on it and post the plot for size of the insert?

0
Entering edit mode

I am not sure what leeHOM means? is that a tool?

0
Entering edit mode

leeHom is a read merging tool. Insert sizes can be calculated by mapping or merging.

0
Entering edit mode

it trims and merges using a Bayesian algorithm. Could you try it and plot the results?

0
Entering edit mode

Well installing it on our cluster would be pain... any other way out?

0
Entering edit mode

running it on your local account?

0
Entering edit mode

looks like its not an easy tool to install, not able to install it on cluster or on my local account make: *** [leeHom.o] Error 1

1
Entering edit mode

I just ran:

git clone --recursive https://github.com/grenaud/leeHom.git
cd leeHom/
cd bamtools/
mkdir build/ && cd build/ && cmake .. && make -j 5
cd ../..
make
src/leeHomMulti


the last command ran fine.

1
Entering edit mode

wow...I have no idea why when I pulled git repository this did not work! now it works...

0
Entering edit mode

could you give it a try and redo the plot you showed?

0
Entering edit mode

sure i will , would like to double check what I used , command please let me know if i need to add something else src/leeHomMulti -t 10 -f AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -s AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -fq1sample_R1.fq -fq2 sample_R2.fq --log sample_log.txt

1
Entering edit mode

use -fqo [put output file prefix],

I suggest you run once with --ancientdna and once without, just for fun.

0
Entering edit mode

0
Entering edit mode

but what about short ones? did you truncate the figure? like 100bp.

0
Entering edit mode

No i did not truncate any reads, just used the command i posted > new fq reads> hisat> picard insert size metrics

0
Entering edit mode

did you map the single end and paired end? there should be a .fq.gz file. https://github.com/grenaud/leeHom#paired-end-mode

0
Entering edit mode

I just mapped paired end.._r1.fq.gz and _r2.fq.gz

0
Entering edit mode

could you map the .fq.gz as single end and merge the BAM files?

1
Entering edit mode

0
Entering edit mode

because the reads with less than 126 bp are found there.

0
Entering edit mode

I don't think picard can handle single end reads for insert size for plots

0
Entering edit mode

samtools view mapped.bam |awk 'function abs(x){return ((x < 0.0) ? -x : x)} {if(and($2,5)==0){print length($10);}else{ if(and($2,66)==66){ print abs($9); }} }'

0
Entering edit mode

awk: line 2: function and never defined awk: line 2: function and never defined

0
Entering edit mode

get pysam and run: python insertSize.py in.bam

#!/usr/bin/python

import pysam
import sys;

bamInputFile  = pysam.Samfile(sys.argv[1], "rb");

continue;
else:

bamInputFile.close();

0
Entering edit mode

would you like me to email it, its a pretty huge file.

0
Entering edit mode

sure my gmail: gabriel [dot here] reno

2
Entering edit mode
5.5 years ago
Gabriel R. ★ 2.9k

If you try leeHom (https://grenaud.github.io/leeHom/), it can trim adapters and merge overlapping using a Bayesian algorithm so no cutoffs are required. It will produce 3 types of files to map:

1. [PREFIX].fq.gz Sequences that were trimmed and merged confidently by leeHom
2. [PREFIX]_r1.fq.gz Forward reads that were neither trimmed nor merged confidently by leeHom
3. [PREFIX]_r2.fq.gz Reverse reads that were neither trimmed nor merged confidently by leeHo

Map the .fq.gz as single reads and r1/r2 as pairs. To plot the insert size, get pysam and run the following script as such: python insertSize.py in.bam

#!/usr/bin/python

import pysam
import sys;

bamInputFile  = pysam.Samfile(sys.argv[1], "rb");

continue;
else:

bamInputFile.close();

1
Entering edit mode
5.5 years ago
Len Trigg ★ 1.6k

My guess is that you have read-through occurring, so that while the adaptors are being trimmed off one read, the other read in the pair is reading through into the adaptor. When only a small amount of read-through has occurred (i.e fragment length just under 129bp), this is not being identified by the trimming tool, and so during alignment, the aligner will put some mismatches at the end of the alignments. When enough read through has occurred, this is either being trimmed, or the aligner is soft-clipping the mismatching end of the read. Try using a trimming tool that identifies read-through.

0
Entering edit mode

any tools that you would reckon for this?

0
Entering edit mode

0
Entering edit mode
5.5 years ago
chen ★ 2.4k

This looks like a bug.

Since your data is PE, you can use AfterQC to trim your adapters, which is based on a different algorithm and doesn't require you to input adapter sequences.

AferQC is available at: https://github.com/OpenGene/AfterQC

0
Entering edit mode
5.5 years ago
ATpoint 68k

Ok, it seems that you have the TruSeq Ribo adapter. You can doublecheck with this file from Illumina, containing the common adapters. In any case, it is not Nextera. Retrim with the proper adapter sequence and the issue should be solved.