Question: Add soft clipping to a bam file
0
gravatar for Lee Katz
5.6 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

Hi, does anyone have a way of adding soft clipping to a bam file?  I want to see what happens if I add 5bp of soft clipping to each read, to see if it removes false-positive SNPs.  If it works, I want to automate it.

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Lee Katz3.0k
2
gravatar for Pierre Lindenbaum
5.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

GATK ClipReads : https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_readutils_ClipReads.php

 

"This tool provides simple, powerful read clipping capabilities that allow you to remove low quality strings of bases, sections of reads, and reads containing user-provided sequences.

There are three options for clipping (quality, position and sequence), which can be used alone or in combination. In addition, you can also specify a clipping representation, which determines exactly how ClipReads applies clips to the reads (soft clips, writing Q0 base quality scores, etc.). Please note that you MUST specify at least one of the three clipping options, and specifying a clipping representation is not sufficient. If you do not specify a clipping option, the program will run but it will not do anything to your reads."

ADD COMMENTlink written 5.6 years ago by Pierre Lindenbaum131k

Thank you!  I'll see if I can use this since I'm not totally used to GATK (although I should be!)

ADD REPLYlink written 5.6 years ago by Lee Katz3.0k
0
gravatar for Lee Katz
5.6 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

Would this work?  I just made this ad-hoc one-liner to break down the cigar string and then replace the first and last 5bp with "S" for soft padding, and then built up the string again.  I am just not sure if it is the correct way to do it and if it will work in all cases.

samtools view sample1.fastq.gz-reference.sorted.bam | perl -lane 'BEGIN{$soft="S"x5;} $str=""; while($F[5]=~/(\d+)(\D+)/g){$str.=$2 x $1;} $str=$soft . substr($str,5,-5) . $soft; $prevNt=substr($str,0,1); $cigar=""; $count=1; for($i=1;$i<length($str);$i++){$thisNt=substr($str,$i,1); if($thisNt eq $prevNt){$count++}else{$cigar.="$count$prevNt"; $count=1; $prevNt=$thisNt;}} $cigar.="$count$prevNt"; $F[5]=$cigar; print join("\t",@F);' 
ADD COMMENTlink written 5.6 years ago by Lee Katz3.0k
1

Simply modifying the cigar string and clipping the read sequence won't be enough. You will have to also shift the position/positions of the reads and its mate. You may have to also edit other tags such as  "NM:" tag because the clipped alignment may be harboring a mismatch or an indel. The more the number of alignment related tags  the more complicated it would be. Changing cigar string can be easy for simple alignments for example where you don't see insertion or deletion or mismatches in the beginning or end of reads but there may be several alignments for which your code may not work correctly. Downstream variant callers will throw an error if they find an alignment (CIGAR) which is not compatible with values in different alignment related tags in the BAM file. I think you should follow Pierre's advice on using GATK to simple clean such reads. Another approach would be to perform the cleaning at the mpileup level. It would be much easier to clean nucleotides that are present towards the end or beginning of read as they are marked using ^ and $ character in the pileup format. But pileup only marks the first and the last nucleotides for a read. Hope it helps. 

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Ashutosh Pandey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1255 users visited in the last hour