Extracting only soft/hard clipped reads from a bam file
4
1
Entering edit mode
18 months ago
jcn ▴ 30

Hello all!

I am working on some data but need a little bit of help with a bit of an unusual task. We are looking at where lentiviral DNA has inserted itself in our host genome, and to do this we need to find the boundaries of the junctions between viral/human DNA. I need to be able to look at only the soft/hard clipped reads within my bam file. Does anyone know a way to do this with awk or some other tool? Thanks! What I have tried:

samtools view -h sample.bam | awk '$6 ~ /H|S/{print}' | samtools view -bS > sample.clipsOnly.bam 

(above solution does not work, and is taken from this post: How to remove reads with hard/soft clipping along with its mate?)

I should also add that filtering for only the hard clipped reads works. This is the error I get when filtering for hard and soft or just soft:

[E::sam_parse1] no SQ lines present in the header

[W::sam_read1] Parse error at line 2

samtools view: error reading file "-"
bam reads NGS clipped data_analysis • 1.4k views
ADD COMMENT
2
Entering edit mode
4 months ago
horeyezonr ▴ 20

Hi

I know it was more than a year, but I'll just leave the answer here maybe it will help someone. With the modified code of yours, the script will only print the alignments that have hardclip or softclip, that means you'll loose the header, which is what samtools complains about after you pipe tha awk result to it.

The solution is to either get the header first, concatenate it with the awk output and export as BAM:

samtools view -H in.bam > header.txt
samtools view in.bam | awk '$6 ~ /H|S/{print} > clipped.txt
cat header.txt clipped.txt | samtools view -bS > clipped.bam

Or if you want to solve it with one command, you can just modify the command to include header as well:

samtools view -h in.sam | awk '$6 ~ /H|S/{print}; $1 ~ /@/{print}' | samtools view -bS - > clipped.bam
ADD COMMENT
1
Entering edit mode
18 months ago
GenoMax 141k

You are missing a - before the > operator that is included in the answer you quoted above.

samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam
ADD COMMENT
0
Entering edit mode
18 months ago
d-cameron ★ 2.9k

Have you considered using a SV breakpoint calling tool (using a reference that includes both host and viral sequence), or a viral integration detection tool? Both of the above approaches should work if your integration site is clonally expanded. If not, there's other software designed for this purpose, although they usually require special sequencing protocol (e.g HIV integration site detection).

To actually answer you question, you can use the ExtractSVReads tool within gridss to do this natively within bam:

java -Xmx4g -cp gridss.jar gridss.ExtractSVReads \
    INPUT=sample.bam
    OUTPUT=clipped.bam
    CLIPPED=true \
    SPLIT=true \
    SINGLE_MAPPED_PAIRED=false \
    DISCORDANT_READ_PAIRS=false \
    UNMAPPED_READS=false \
    INDELS=false \
    MIN_CLIP_LENGTH=1 \
    INCLUDE_DUPLICATES=true \
    REFERENCE_SEQUENCE=reference.fasta \
    TMP_DIR=. \
    ASSUME_SORTED=true
ADD COMMENT
0
Entering edit mode

If you want multithreaded I/O the command-line is a pretty annoying (since you're running a tool within gridss, not the whole pipeline): you need to replace the java -Xmx4g -cp gridss.jar bit with:

java -Xmx4g -Dsamjdk.use_async_io_read_samtools=true -Dsamjdk.use_async_io_write_samtools=true  -Dsamjdk.buffer_size=4194304 -Dsamjdk.async_io_read_threads=$(nproc) -cp gridss.jar
ADD REPLY
0
Entering edit mode
4 months ago

only soft clipped reads (but should be ok in most cases, as in bwa only secondary reads are hard clipped):

samtools view -O BAM -o out.bam -e 'sclen>0' in.bam

see http://www.htslib.org/doc/samtools.html#FILTER_EXPRESSIONS

ADD COMMENT

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6