Question: Distance to reads ends
0
gravatar for Max Ivon
5.0 years ago by
Max Ivon120
Russian Federation
Max Ivon120 wrote:

Hello everyone!
Im using SAMTOOLS, GATK UG and HC for SNP calling from target sequencing paired-end data. In contrast to all the others, UG produces massive amount of mutations near read starts/ends. Quality trimming actually is not ably to handle this problem completely since there is sometimes aligments errors near read ends. I didnt find any usefull information about root mean square distance to reads ends or something like this that GATK keeps in information fields in VCF. So im interesting if there is any opportunity to skip mutations falling near read start/end using GATK or how can i avoid this problem. And also is there any ready solution to calculate mean distance or root mean square distance to the nearest read end from single base position based on sam/bam file?

Thanks in advance

sequencing snp samtools gatk • 1.5k views
ADD COMMENTlink modified 5.0 years ago by Pierre Lindenbaum129k • written 5.0 years ago by Max Ivon120

not clear to me: one variant in a VCF != one mismatch in one read .
 

ADD REPLYlink written 5.0 years ago by Pierre Lindenbaum129k

It depends on the parameters used for variant calling (dont know for sure if is it possible for GATK to report each mismatch or not, but samtools definitely can). But I understand that generally its true. Im talking about systematic error (a lot of mismatches/alignemt errors) near the read ends (usually first b.p. of the read), which are reported by GATK UG as high quality mutations, but they surely are not. And the main quastion is: is there any ready solution to calculate the mean root square distance to the closest reads ends from selected base pair position based on the sam/bam file. I think this problem is rather sufficient in target sequencing and there should be ready solutions to solve it.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Max Ivon120
2
gravatar for Pierre Lindenbaum
5.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

using https://github.com/lindenb/jvarkit/wiki/SamFixCigar to get the 'X' in the cigar string, and https://github.com/lindenb/jvarkit/wiki/BioAlcidae to print the closest distance between a 'X' and the ends of the reads.

the javascript part:

while(iter.hasNext())
    {
    var rec = iter.next();
    if(rec.getReadUnmappedFlag()) continue;
    var cigar = rec.getCigar();
    if( cigar == null ) continue;

     var varPos=-1;
    var readpos=0;
    for(var i=0;i< cigar.numCigarElements();++i)
        {
        var ce = cigar.getCigarElement(i);
        var op = ce.getOperator();
        if( ! op.consumesReferenceBases()) continue;
        for(var x=0;x < ce.getLength();++x)
            {
            if( op.name() == "X")
                {
                   var distance = Math.min(readpos, (rec.getReadLength()-1)-readpos);
                   if( varPos ==-1 ||  distance< varPos )
                       {
                       varPos= distance;
                       }
                }
            readpos++;
            }
        }
    if(varPos!=-1)
        {
        out.println(varPos);
        }
    }

invoke:

 java -jar dist-1.133/samfixcigar.jar -r ref.fa input.bam | java -jar dist-1.133/bioalcidae.jar -f script.js -F sam

 

output:

13.0
2.0
12.0
13.0
24.0
18.0
9.0
2.0
17.0
17.0
2.0
11.0
1.0
6.0
7.0
1.0
1.0
9.0
14.0
22.0
0.0
1.0
2.0
0.0

 

 

ADD COMMENTlink written 5.0 years ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 690 users visited in the last hour