paired-end bam clip bases outside insert range
1
0
Entering edit mode
3.3 years ago

Hello!

I have paired-end data with adapters which contain random sequences (UMIs). The problem is the 3' read trimming when the insert size is smaller than the read.

I would like to perform read trimming (or soft-clipping) by using the mapping coordinates of the read mate. (i.e. when Read1 End is greater than Read2 Start trim/soft-clip the 3' end bases exceeding the real insert).

    -------------|-->
<--|-------------

I could write code for this but I was wondering if anyone knows an already existing tool which enables trimming reads by coordinate from a bam file.

Thanks in advance,

Pau

bam short insert size trimming by coordinate • 1.6k views
ADD COMMENT
0
Entering edit mode

You should be able to use bbmerge.sh which is part of BBMap suite with the following option set to t (it is false by default). You will get a merged representation of the read though at the end.

trimnonoverlapping=f (tno) Trim all non-overlapping portions, leaving only
                     consensus sequence.
ADD REPLY
0
Entering edit mode

Thanks GenoMax!

But I really need both reads clipped separately...

ADD REPLY
0
Entering edit mode

Think this program by @Pierre may fit the bill then: A: Remove Soft Clipped Bases or the second answer in the same thread.

You could also take the merged read from bbmerge and then pull individual reads/compare and clip them from R1/R2 files using custom code (you will need to RC R2 read).

ADD REPLY
0
Entering edit mode

I don't think my program is suitable for this task (it just removes the clipped bases)

ADD REPLY
0
Entering edit mode

I was wondering if anyone knows an already existing tool which enables trimming reads by coordinate from a bam file.

I was going by this request in original post.

Then the clipped BAM can be converted back to fastq after using your program?

ADD REPLY
0
Entering edit mode

Pierre's tool removes clipped bases... My bases would need first to be clipped (by insert-size/coordinate)

ADD REPLY
0
Entering edit mode

Thanks for your suggestions GenoMax!

Using bbmerge as an intermediate to then pull out the trimmed seq by comparing sequences with original fastqs, could be an option, although having to read and align twice the data (merge step + align seqs to original fastqs) might not be very efficient.

That's why I thought about using mapping info to reference genome... From bam file I could first subset the small fraction of reads which fall in the problematic scenario (by using insert size). Then, if no tool is available to trim by insert-size, I could iterate through these problematic read subset and discard last [read length - insert-size] bases... Finally I can merge again the modified subset to the original bam.

ADD REPLY
1
Entering edit mode
3.3 years ago

I wrote http://lindenb.github.io/jvarkit/Biostar480685.html . Input must be sorted on query-name using samtools sort -n or samtools collate

samtools collate -O input.bam| java -jar dist/biostar480685.jar
ADD COMMENT
1
Entering edit mode

Hi Pierre,

Sorry for answering so late. I've been busy with other business and haven't seen it until now

Thanks a lot for your effort!

I'm sorry but it does not really work as expected. I have now looked at it carefully and it softclips much more bases than expected. In the image below you can see the modified bam and below the original one...

enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 2704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6