Question: Detecting junctions from reads spanning large indels
0
gravatar for Adrian Pelin
3.1 years ago by
Adrian Pelin2.4k
Canada
Adrian Pelin2.4k wrote:

Hello,

I am mapping my reads with an aligner that allows for large indels (bbmap, other suggestions welcome).

I am than using SamJs to get reads that have large indels >1kb

java -jar ${HOME}/programs/jvarkit/dist/samjs.jar --file get_indel_sam_filter.javascript --samoutputformat SAM -o Fasteris_Copenhagen_PlaquePurified_150903_SND104_B_L007_R1_HXE-3_mapped_Indel-1kb_v2.sam Fasteris_Copenhagen_PlaquePurified_150903_SND104_B_L007_R1_HXE-3_mapped.sort.bam

Java script used, cat get_indel_sam_filter.javascript :

function accept(rec) {
if(rec.getReadUnmappedFlag()) return true; 
var c=rec.getCigar(); if(c==null) return true; 
for(var i=0;i< c.numCigarElements();++i) {
var ce=c.getCigarElement(i);
if(ce.getLength()<1000) continue;
var s=ce.getOperator().name(); 
if( s=="D") return true; 
} return false;
}accept(record)

Now I would like to get a coordinate file with the start and end location of indels. For instance this record:

HWI-D00104:276:C6U65ANXX:7:1207:20052:96123 1:N:0:TTAGGC    0   Copenhagen_M35027   113671  38  89=11532D36=    *   0   0   TTTTTAACAGTTTGAGGTTTTAGATTTTTAGTTACAGAAGTGATATCGAATATTTTATCCAAAAAGAATGAGTAATTAATTGTCTTAGAAACGTCAACAACGAGAAAAAAGATTTAATATTACAG   BBBBBFFFFBFFFFFFFFBFFFFFF/BFF<<FFFB/B/FFFFFF//BFBBBBFFFFFFFFFFBFFFFBB/FBFFFFFFB<FFBFBFF<FF/F/FBFF</B</BFFBFFF/FF/<F////FBB/B7   AM:i:38 NM:i:11532

Has an indel between 113759 and 125292.

I am not sure if there is already a tool that easily finds these. Ideally, the cigar string before and after the indel is an 100% match (like in the example record).

Thanks.

sam dna-seq • 931 views
ADD COMMENTlink modified 3.1 years ago by Pierre Lindenbaum131k • written 3.1 years ago by Adrian Pelin2.4k
2
gravatar for Pierre Lindenbaum
3.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

java -jar dist/bioalcidaejdk -e  'stream().filter(R->!R.getReadUnmappedFlag()).forEach(R->{ int refpos = R.getStart(); for(CigarElement ce:R.getCigar()) { CigarOperator op = ce.getOperator(); int len = ce.getLength(); if(len>=1000 && (op.equals(CigarOperator.N) || op.equals(CigarOperator.D))) { println(R.getContig()+"\t"+(refpos-1)+"\t"+(refpos+len)); } if(op.consumesReferenceBases())  refpos+=len; } }); ' in.bam

later: fixed extra semi-colon after ' if(op.consumesReferenceBases()) ...'

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Pierre Lindenbaum131k

sweet, goes directly to coordinates. Is there any way to get an intermediary sam file to see which reads were counted for generating coordinates?

ADD REPLYlink written 3.1 years ago by Adrian Pelin2.4k
1

using samjdk (faster than samjs)

 java -jar dist/samjdk.jar -e 'if(record.getReadUnmappedFlag()) return false; int refpos = record.getStart(); for(CigarElement ce:record.getCigar()) { CigarOperator op = ce.getOperator(); int len = ce.getLength(); if(len>=1000 && (op.equals(CigarOperator.N) || op.equals(CigarOperator.D))) { return true; } if(op.consumesReferenceBases()) refpos+=len; } return false;' in.bam
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Pierre Lindenbaum131k

sorry: that was a stupid copy+paste, here is a shorter syntax:

$ java -jar dist/samjdk.jar -e 'return !record.getReadUnmappedFlag() && record.getCigar().getCigarElements().stream().anyMatch(C->C.getLength()>=1000 && (C.getOperator()==CigarOperator.N || C.getOperator()==CigarOperator.D));'  in.bam
ADD REPLYlink written 3.1 years ago by Pierre Lindenbaum131k

oh I see. Both still do the same thing though, correct?

ADD REPLYlink written 3.1 years ago by Adrian Pelin2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2122 users visited in the last hour