Question: How to get percent identity from bam file ?
0
gravatar for David
12 months ago by
David150
David150 wrote:

I have aligned raw 1D nanopore reads to a bacterial reference assembly using BWA –MEM and got the sam and bam file.

Is there a quick way to get the of percentage identity for each read from the sam/bam file ???

I would like to generate an histogram with reads count and the percent identity to the reference.

Thanks,

bwa samtools bam • 1.2k views
ADD COMMENTlink modified 12 months ago • written 12 months ago by David150

Here is the output ??

bioalcidaejdk   -e 'stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});' 1.bam

[DEBUG][BioAlcidaeJdk] Compiling :
         1  import java.util.*;
         2  import java.util.stream.*;
         3  import java.util.regex.*;
         4  import java.util.function.*;
         5  import htsjdk.samtools.*;
         6  import htsjdk.samtools.util.*;
         7  import htsjdk.variant.variantcontext.*;
         8  import htsjdk.variant.vcf.*;
         9  import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
        10  import javax.annotation.Generated;
        11  import htsjdk.variant.vcf.*;
        12  /** begin user's packages */
        13  /** end user's packages */
        14  @Generated(value="BioAlcidaeJdk",date="2018-03-22T12:40:58+0100")
        15  public class BioAlcidaeJdkCustom1825748580 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.SAMHandler {
        16    public BioAlcidaeJdkCustom1825748580() {
        17    }
        18    @Override
        19    public void execute() throws Exception {
        20     // user's code starts here 
        21     stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});
        22      //user's code ends here 
        23     }
        24  }

I don´t get anything else ???

ADD REPLYlink written 12 months ago by David150
1

Here is the output, redirecting stderr to /dev/null, using a BAM processed with bwa/samtools: (https://github.com/lindenb/jvarkit/blob/master/src/test/resources/S1.bam )

$ java -jar dist/bioalcidaejdk.jar -e 'stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});' 2> /dev/null src/test/resources/S1.bam

97  479
98  696
100 476
90  1
91  4
92  28
94  73
95  241
ADD REPLYlink modified 12 months ago • written 12 months ago by Pierre Lindenbaum118k

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This comment belongs under @Pierre's answer.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax64k
0
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

use the edit distance 'NM' ? Here is a solution using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});' input.bam
ADD COMMENTlink written 12 months ago by Pierre Lindenbaum118k

My bam file was generated as follows:

minimap2 -ax asm10 -t 8 canuAssembly.contigs.mmi demultiplex/BC01.fastq.gz | samtools sort -@8 -o 1.sorted.bam

Ignoring SAM validation error: ERROR: Record 15966, Read name 1b0bcee6-ba9e-409d-9172-4817a6e6fc4a, CIGAR covers 5190 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 15970, Read name d47ea4a6-ac3a-491d-ad02-fca9ed47a2fd, CIGAR covers 4084 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 15971, Read name 1b0bcee6-ba9e-409d-9172-4817a6e6fc4a, CIGAR covers 5190 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 15973, Read name 80fa6511-2211-43fd-a6cf-fc487c110a54, CIGAR covers 2272 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 16109, Read name 94561bf6-e9dd-4751-acd7-3593ac6f84e5, CIGAR covers 12412 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 16113, Read name 52159ffb-ab8a-409f-add4-142e065e463a, CIGAR covers 5135 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 16114, Read name 530f195d-2f2b-4c06-a2c9-f2e565ae779c, CIGAR covers 19737 bases but the sequence is 0 read bases 0 37748 96 1207

0 37748

96 1207

-2147483648 180

97 1054

98 837

99 266

70 1

74 2

80 1

82 1

83 4

84 1

85 11

86 16

87 73

88 389

89 1427

90 2762

91 2513

92 1589

93 1181

94 1275

95 1331

ADD REPLYlink modified 12 months ago • written 12 months ago by David150

so this is your histogram : 1st column is read-length - NM, second column is occurence. negative number i (2147483648) might be due to some error in your bams (there is a CIGAR string but the SEQ is empty)

ADD REPLYlink written 12 months ago by Pierre Lindenbaum118k

Sorry Pierre but I think the original question asked a read by read assessment of percent identity, rather than histogram. I was looking for that as well. Is there any way to tailor this so that one column is read name and second column is percent identity to ref? Thanks

ADD REPLYlink written 5 weeks ago by Adrian Pelin2.2k
0
gravatar for WouterDeCoster
12 months ago by
Belgium
WouterDeCoster37k wrote:

If you would use google you might find some hits. For me the top hit was... my own blog, containing a python solution based on either the NM or MD tag.

Shameless advertisement: plots like that (similar) are available in NanoPlot.

ADD COMMENTlink modified 12 months ago • written 12 months ago by WouterDeCoster37k

I´ll check Nanoplot. thanks

ADD REPLYlink written 12 months ago by David150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1120 users visited in the last hour