Question

Extracting only soft/hard clipped reads from a bam file

1

Entering edit mode

19 months ago

jcn ▴ 30

Hello all!

I am working on some data but need a little bit of help with a bit of an unusual task. We are looking at where lentiviral DNA has inserted itself in our host genome, and to do this we need to find the boundaries of the junctions between viral/human DNA. I need to be able to look at only the soft/hard clipped reads within my bam file. Does anyone know a way to do this with awk or some other tool? Thanks! What I have tried:

samtools view -h sample.bam | awk '$6 ~ /H|S/{print}' | samtools view -bS > sample.clipsOnly.bam

(above solution does not work, and is taken from this post: How to remove reads with hard/soft clipping along with its mate?)

I should also add that filtering for only the hard clipped reads works. This is the error I get when filtering for hard and soft or just soft:

[E::sam_parse1] no SQ lines present in the header

[W::sam_read1] Parse error at line 2

samtools view: error reading file "-"

bam reads NGS clipped data_analysis • 1.4k views

ADD COMMENT • link updated 4 months ago by Pierre Lindenbaum 161k • written 19 months ago by jcn ▴ 30

score 2 · Answer 1 · 2023-11-30

Hi

I know it was more than a year, but I'll just leave the answer here maybe it will help someone. With the modified code of yours, the script will only print the alignments that have hardclip or softclip, that means you'll loose the header, which is what samtools complains about after you pipe tha awk result to it.

The solution is to either get the header first, concatenate it with the awk output and export as BAM:

samtools view -H in.bam > header.txt
samtools view in.bam | awk '$6 ~ /H|S/{print} > clipped.txt
cat header.txt clipped.txt | samtools view -bS > clipped.bam

Or if you want to solve it with one command, you can just modify the command to include header as well:

samtools view -h in.sam | awk '$6 ~ /H|S/{print}; $1 ~ /@/{print}' | samtools view -bS - > clipped.bam

score 1 · Answer 2 · 2022-09-28

1

Entering edit mode

19 months ago

GenoMax 141k

You are missing a - before the > operator that is included in the answer you quoted above.

samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam

ADD COMMENT • link 19 months ago by GenoMax 141k

score 0 · Answer 3 · 2022-09-29

Have you considered using a SV breakpoint calling tool (using a reference that includes both host and viral sequence), or a viral integration detection tool? Both of the above approaches should work if your integration site is clonally expanded. If not, there's other software designed for this purpose, although they usually require special sequencing protocol (e.g HIV integration site detection).

To actually answer you question, you can use the ExtractSVReads tool within gridss to do this natively within bam:

java -Xmx4g -cp gridss.jar gridss.ExtractSVReads \
    INPUT=sample.bam
    OUTPUT=clipped.bam
    CLIPPED=true \
    SPLIT=true \
    SINGLE_MAPPED_PAIRED=false \
    DISCORDANT_READ_PAIRS=false \
    UNMAPPED_READS=false \
    INDELS=false \
    MIN_CLIP_LENGTH=1 \
    INCLUDE_DUPLICATES=true \
    REFERENCE_SEQUENCE=reference.fasta \
    TMP_DIR=. \
    ASSUME_SORTED=true

score 0 · Answer 4 · 2023-11-30

0

Entering edit mode

4 months ago

Pierre Lindenbaum 161k

only soft clipped reads (but should be ok in most cases, as in bwa only secondary reads are hard clipped):

samtools view -O BAM -o out.bam -e 'sclen>0' in.bam

see http://www.htslib.org/doc/samtools.html#FILTER_EXPRESSIONS

ADD COMMENT • link 4 months ago by Pierre Lindenbaum 161k