Question: How to extract coordinate of hard and soft clipped long reads ?
0
gravatar for BioGeek
7 months ago by
BioGeek140
BioGeek140 wrote:

I wanted to extract the genome coordinate of all those location where longreads are soft/hard clipped.

Genome
                          ] [
__________________________________________________________________________________
--- ----------------------- --------------------------------- --------------------
--------------------------- -----------------------------------------
---- ---------------------- -------------------- --------------------------------
---------------------------- ------------------------------  -----------
  ------------------------ --------------------------------------------------------
 ------- ------------------ ------------------------- ----------------------------

                           ^

the arrow (^) where all the long reads are soft/hard clipped. Interestingly these region have few bases overlaps ??

bam mapping longreads clipped • 304 views
ADD COMMENTlink modified 7 months ago by Carambakaracho1.1k • written 7 months ago by BioGeek140
0
gravatar for Vitis
7 months ago by
Vitis2.1k
New York
Vitis2.1k wrote:

Probably by reading and analyzing the CIGAR strings, finding all operations before "H" and "S" and calculating the clipping locations based on mapping locations and operation lengths before clipping.

For example, the CIGAR for one read is 120M31S, the mapping position for the read is 12345, then the clipping location might be in the area close to 12345 + 120 = 12465.

ADD COMMENTlink written 7 months ago by Vitis2.1k
0
gravatar for Pierre Lindenbaum
7 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

using Bioalcidae http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html : stream the bam, for each read extract the unclipped/start|end and clipped start/end. If there is a clip, print the BED record(s).

$ java -jar dist/bioalcidaejdk.jar -e 'stream().filter(R->!R.getReadUnmappedFlag()).forEach(R->{int x1=R.getUnclippedStart(),x2=R.getStart();if(x1<x2) println(R.getContig()+"\t"+(x1-1)+"\t"+(x2)+"\t"+R.getReadName());x1=R.getEnd();x2=R.getUnclippedEnd();if(x1<x2) println(R.getContig()+"\t"+(x1-1)+"\t"+(x2)+"\t"+R.getReadName());});'  src/test/resources/S1.bam 

RF01    567 576 RF01_508_1107_4:0:0_0:0:0_af
RF01    1679    1682    RF01_1679_2192_3:0:0_3:0:0_a8
RF01    1886    1889    RF01_1821_2315_3:0:0_2:0:0_98
RF01    1989    1993    RF01_1445_1994_3:0:0_3:0:0_5e
RF02    407 410 RF02_98_411_1:0:0_3:0:0_2e
RF02    446 450 RF02_382_836_3:0:0_0:0:0_82
RF02    1811    1821    RF02_1811_2313_5:0:0_2:0:0_74
RF02    2220    2224    RF02_1737_2289_3:0:0_4:0:0_5e
RF03    739 741 RF03_327_808_1:0:0_3:0:0_1e
RF03    813 822 RF03_309_823_1:0:0_4:0:0_82
(...)
ADD COMMENTlink modified 6 months ago • written 7 months ago by Pierre Lindenbaum120k

RF01 567 576 RF01_508_1107_4:0:0_0:0:0_af

I think these are longreads location! Is this possible to get reference chromosome/scaffold/contig location ?

ADD REPLYlink written 4 months ago by BioGeek140
0
gravatar for Carambakaracho
7 months ago by
Carambakaracho1.1k
Switzerland/Basel
Carambakaracho1.1k wrote:

In case you're really only interested in the start position, this might do the trick

perl -nwe 'next if /^@/;@a=split;if($a[5]~=/^\d*H*(\d+)S/{$a[3]+$1; print "$a[0]\t$a[3]"}' <samtools view my.bam >out.txt
ADD COMMENTlink modified 7 months ago • written 7 months ago by Carambakaracho1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1980 users visited in the last hour