Question: weird insert size post trimming
2
gravatar for badribio
3.4 years ago by
badribio240
badribio240 wrote:

I am analyzing rna seq data from illumina stranded protocol 126bp -PE , sequences have nextera adapters using cutadapt I trimmed all the sequences but now I have reads with different lengths, also my insert size form picard has weird peaks, find below has any one experienced this before? enter image description here

EDIT

Apologies I did little more digging around the pipeline and found out that flexbar was used to remove adapters

and the command used was

flexbar --adapters adapters.file --adapter-trim-end RIGHT --length-dist --threads 12 --adapter-min-overlap 7 --max-uncalled 250 --min-read-length 25

adapter used

Read_1_Sequencing_Primer_3_to_5 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Read_2_Sequencing_Primer_3_to_5 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Read1

enter image description here

Read2enter image description here

Using leeHom tool.

enter image description here

log file leeHom

Total reads :231670526  100%
Merged (trimming) 22238730      9.59929%
Merged (overlap) 0      0%
Kept PE/SR 93263449     40.2569%
Trimmed SR 0    0%
Adapter dimers/chimeras 533995  0.230498%
Failed Key 0    0%
rna-seq • 2.7k views
ADD COMMENTlink modified 3.4 years ago by Gabriel R.2.8k • written 3.4 years ago by badribio240

Please give the full command for cutadapt. I saw this kind of odd shape before, but it turned out that adapters were improperly trimmed. What are the Nextera sequences that you trimmed for?

ADD REPLYlink written 3.4 years ago by ATpoint44k

Agreed, this graph indicates 2 things:

1) incomplete adapter trimming 2) incorrect insert size calculation

The gap is present because reads with only a few (say, under 10 or so) adapter bases are not getting trimmed. Then when they map and appear to overlap the insert size is not calculated correctly. I suggest using a different method for both trimming and insert-size calculation.

ADD REPLYlink written 3.4 years ago by Brian Bushnell17k

updated post, with more details

ADD REPLYlink written 3.4 years ago by badribio240

Also be sure that the adapter file is correct. CTGTCTCTTATACACATCTCCGAGCCCACGAGAC is what you have to trim for in case that the Nextera adapter was used for library prep (on both strands, same adapter sequence).

ADD REPLYlink written 3.4 years ago by ATpoint44k

I have edited my post, looks like i have two diffrent adapters and not the same one

ADD REPLYlink written 3.4 years ago by badribio240

They have been trimmed using these adapters and flux bar command

ADD REPLYlink written 3.4 years ago by badribio240

Any chance you might post part of the data somewhere on an ftp or dropbox?

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k

Sorry I wish I could..

ADD REPLYlink written 3.4 years ago by badribio240

could you try to run leeHom on it and post the plot for size of the insert?

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k

I am not sure what leeHOM means? is that a tool?

ADD REPLYlink written 3.4 years ago by badribio240

leeHom is a read merging tool. Insert sizes can be calculated by mapping or merging.

ADD REPLYlink written 3.4 years ago by Brian Bushnell17k

it trims and merges using a Bayesian algorithm. Could you try it and plot the results?

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k

Well installing it on our cluster would be pain... any other way out?

ADD REPLYlink written 3.4 years ago by badribio240

running it on your local account?

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k

looks like its not an easy tool to install, not able to install it on cluster or on my local account make: *** [leeHom.o] Error 1

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by badribio240
1

I just ran:

git clone --recursive https://github.com/grenaud/leeHom.git
cd leeHom/
cd bamtools/
mkdir build/ && cd build/ && cmake .. && make -j 5
cd ../..
make
src/leeHomMulti

the last command ran fine.

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k
1

wow...I have no idea why when I pulled git repository this did not work! now it works...

ADD REPLYlink written 3.4 years ago by badribio240

could you give it a try and redo the plot you showed?

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k

sure i will , would like to double check what I used , command please let me know if i need to add something else src/leeHomMulti -t 10 -f AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -s AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -fq1sample_R1.fq -fq2 sample_R2.fq --log sample_log.txt

ADD REPLYlink written 3.4 years ago by badribio240
1

use -fqo [put output file prefix],

I suggest you run once with --ancientdna and once without, just for fun.

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k

I have uploaded the figure. What is leeHom capturing that other tools are not? could you please help me understand

ADD REPLYlink written 3.4 years ago by badribio240

but what about short ones? did you truncate the figure? like 100bp.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Gabriel R.2.8k

No i did not truncate any reads, just used the command i posted > new fq reads> hisat> picard insert size metrics

ADD REPLYlink written 3.4 years ago by badribio240

did you map the single end and paired end? there should be a .fq.gz file. https://github.com/grenaud/leeHom#paired-end-mode

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k

I just mapped paired end.._r1.fq.gz and _r2.fq.gz

ADD REPLYlink written 3.4 years ago by badribio240

could you map the .fq.gz as single end and merge the BAM files?

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k
1

Yes will upload in sometime..

ADD REPLYlink written 3.4 years ago by badribio240

because the reads with less than 126 bp are found there.

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k

I don't think picard can handle single end reads for insert size for plots

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by badribio240

how about:

samtools view mapped.bam |awk 'function abs(x){return ((x < 0.0) ? -x : x)} {if(and($2,5)==0){print length($10);}else{ if(and($2,66)==66){ print abs($9); }} }'
ADD REPLYlink modified 3.4 years ago by GenoMax94k • written 3.4 years ago by Gabriel R.2.8k

awk: line 2: function and never defined awk: line 2: function and never defined

ADD REPLYlink written 3.4 years ago by badribio240

get pysam and run: python insertSize.py in.bam

#!/usr/bin/python

import pysam
import sys; 


bamInputFile  = pysam.Samfile(sys.argv[1], "rb");


for read in bamInputFile:
    if read.is_unmapped :
        continue;
    if read.is_paired :
        if read.is_proper_pair and read.is_read1 :
            print abs(read.template_length);       
    else:
        print len(read.query_alignment_sequence);

bamInputFile.close();
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Gabriel R.2.8k

would you like me to email it, its a pretty huge file.

ADD REPLYlink written 3.4 years ago by badribio240

sure my gmail: gabriel [dot here] reno

ADD REPLYlink written 3.4 years ago by Gabriel R.2.8k
2
gravatar for Gabriel R.
3.4 years ago by
Gabriel R.2.8k
Danmarks Tekniske Universitet
Gabriel R.2.8k wrote:

If you try leeHom (https://grenaud.github.io/leeHom/), it can trim adapters and merge overlapping using a Bayesian algorithm so no cutoffs are required. It will produce 3 types of files to map:

  1. [PREFIX].fq.gz Sequences that were trimmed and merged confidently by leeHom
  2. [PREFIX]_r1.fq.gz Forward reads that were neither trimmed nor merged confidently by leeHom
  3. [PREFIX]_r2.fq.gz Reverse reads that were neither trimmed nor merged confidently by leeHo

Map the .fq.gz as single reads and r1/r2 as pairs. To plot the insert size, get pysam and run the following script as such: python insertSize.py in.bam

#!/usr/bin/python

import pysam
import sys; 


bamInputFile  = pysam.Samfile(sys.argv[1], "rb");


for read in bamInputFile:
    if read.is_unmapped :
        continue;
    if read.is_paired :
        if read.is_proper_pair and read.is_read1 :
            print abs(read.template_length);       
    else:
        print len(read.query_alignment_sequence);

bamInputFile.close();
ADD COMMENTlink written 3.4 years ago by Gabriel R.2.8k
0
gravatar for chen
3.4 years ago by
chen2.1k
OpenGene
chen2.1k wrote:

This looks like a bug.

Since your data is PE, you can use AfterQC to trim your adapters, which is based on a different algorithm and doesn't require you to input adapter sequences.

AferQC is available at: https://github.com/OpenGene/AfterQC

ADD COMMENTlink written 3.4 years ago by chen2.1k
0
gravatar for Len Trigg
3.4 years ago by
Len Trigg1.5k
New Zealand
Len Trigg1.5k wrote:

My guess is that you have read-through occurring, so that while the adaptors are being trimmed off one read, the other read in the pair is reading through into the adaptor. When only a small amount of read-through has occurred (i.e fragment length just under 129bp), this is not being identified by the trimming tool, and so during alignment, the aligner will put some mismatches at the end of the alignments. When enough read through has occurred, this is either being trimmed, or the aligner is soft-clipping the mismatching end of the read. Try using a trimming tool that identifies read-through.

ADD COMMENTlink written 3.4 years ago by Len Trigg1.5k

any tools that you would reckon for this?

ADD REPLYlink written 3.4 years ago by badribio240

You can try AfterQC, it trims reads based on overlapping analysis and may address your problem.

ADD REPLYlink written 3.4 years ago by chen2.1k
0
gravatar for ATpoint
3.4 years ago by
ATpoint44k
ATpoint44k wrote:

Ok, it seems that you have the TruSeq Ribo adapter. You can doublecheck with this file from Illumina, containing the common adapters. In any case, it is not Nextera. Retrim with the proper adapter sequence and the issue should be solved.

ADD COMMENTlink written 3.4 years ago by ATpoint44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 979 users visited in the last hour