Question

Distance to reads ends

0

Entering edit mode

8.7 years ago

Max Ivon ▴ 130

Hello everyone!
Im using SAMTOOLS, GATK UG and HC for SNP calling from target sequencing paired-end data. In contrast to all the others, UG produces massive amount of mutations near read starts/ends. Quality trimming actually is not ably to handle this problem completely since there is sometimes aligments errors near read ends. I didnt find any usefull information about root mean square distance to reads ends or something like this that GATK keeps in information fields in VCF. So im interesting if there is any opportunity to skip mutations falling near read start/end using GATK or how can i avoid this problem. And also is there any ready solution to calculate mean distance or root mean square distance to the nearest read end from single base position based on sam/bam file?

Thanks in advance

samtools GATK sequencing SNP • 2.5k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by Max Ivon ▴ 130

0

Entering edit mode

Not clear to me: one variant in a VCF != one mismatch in one read .

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

It depends on the parameters used for variant calling (dont know for sure if is it possible for GATK to report each mismatch or not, but samtools definitely can). But I understand that generally its true. Im talking about systematic error (a lot of mismatches/alignemt errors) near the read ends (usually first b.p. of the read), which are reported by GATK UG as high quality mutations, but they surely are not. And the main quastion is: is there any ready solution to calculate the mean root square distance to the closest reads ends from selected base pair position based on the sam/bam file. I think this problem is rather sufficient in target sequencing and there should be ready solutions to solve it.

ADD REPLY • link 8.7 years ago by Max Ivon ▴ 130

Ram · Answer 1 · 2015-08-03

Using https://github.com/lindenb/jvarkit/wiki/SamFixCigar to get the 'X' in the cigar string, and https://github.com/lindenb/jvarkit/wiki/BioAlcidae to print the closest distance between a 'X' and the ends of the reads.

The javascript part:

while(iter.hasNext())
    {
    var rec = iter.next();
    if(rec.getReadUnmappedFlag()) continue;
    var cigar = rec.getCigar();
    if( cigar == null ) continue;

     var varPos=-1;
    var readpos=0;
    for(var i=0;i< cigar.numCigarElements();++i)
        {
        var ce = cigar.getCigarElement(i);
        var op = ce.getOperator();
        if( ! op.consumesReferenceBases()) continue;
        for(var x=0;x < ce.getLength();++x)
            {
            if( op.name() == "X")
                {
                   var distance = Math.min(readpos, (rec.getReadLength()-1)-readpos);
                   if( varPos ==-1 ||  distance< varPos )
                       {
                       varPos= distance;
                       }
                }
            readpos++;
            }
        }
    if(varPos!=-1)
        {
        out.println(varPos);
        }
    }

invoke:

java -jar dist-1.133/samfixcigar.jar -r ref.fa input.bam | java -jar dist-1.133/bioalcidae.jar -f script.js -F sam

output:

13.0
2.0
12.0
13.0
24.0
18.0
9.0
2.0
17.0
17.0
2.0
11.0
1.0
6.0
7.0
1.0
1.0
9.0
14.0
22.0
0.0
1.0
2.0
0.0