Question: Skipping intronic locations of spliced reads (N CIGAR) in sam2tsv (jvarkit)
0
gravatar for kbiostars
6 weeks ago by
kbiostars0
kbiostars0 wrote:

Hi,

I am using sam2tsv (from jvarkit) to print BAM alignments from HISAT2 as a TAB delimited file, which has been extremely useful to find the position of mutations that do not match the reference.

Since the BAM files are large, I have split them using GATK SplitSamByNumberOfReads to run sam2tsv in each of them in parallel.

However, the time it takes to run sam2tsv is still very long. Since I used a splice-aware aligner for my RNA-seq data, some reads in my BAM file span introns (denoted by the CIGAR operator N). I believe that sam2tsv is taking very long to run due to the printing of the "N" positions of the read (which sometimes comprise thousands of base pairs). For example, this might be the CIGAR information of a read in my BAM file: "56M883N39M". I believe that printing those 883N in sam2tsv is what is slowing down the process. I tried to pipe the output of sam2tsv to awk to get only "M" positions, but that is also slow, as one would expect.

Questions

  1. Is there a way to have sam2tsv skip N positions while it is printing the TAB delimited file? (as opposed to using awk afterwards to filter out rows that have an N as the CIGAR operator).

  2. As alternative, is it possible to strip the "N" nucleotides from the BAM file while keeping the "M" information of the read (i.e. the matched positions)? I am not talking about filtering out the whole spliced read. Rather, only the information about the intron, in order to speed up processing in sam2tsv.

Thank you very much!

sam2tsv bam jvarkit • 106 views
ADD COMMENTlink modified 6 weeks ago by Pierre Lindenbaum129k • written 6 weeks ago by kbiostars0

which has been extremely useful to find the position of mutations that do not match the reference

why don't you use bcftools mpileup ?

ADD REPLYlink written 6 weeks ago by Pierre Lindenbaum129k

I also need the name of the reads that have the mutations, the position of the mutation within the read, and the nucleotide composition of the read with the mutation. I thought that with your sam2tsv you could get all that information with a single tool quickly.

ADD REPLYlink written 6 weeks ago by kbiostars0
3
gravatar for Pierre Lindenbaum
6 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

in https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/sam2tsv/Sam2Tsv.java#L405

replace

             case N :  //cont. -- reference skip

with

             case N: refIndex+=e.getLength(); break;
ADD COMMENTlink written 6 weeks ago by Pierre Lindenbaum129k

Hi! Thank you so much for your quick response! I have just tested and works perfectly fine - there is no output of lines with an "N" cigar operator. I will test how this improves running times.

Thanks for developing these tools and your contributions to the community! Will make sure to include the citation to your tools!

ADD REPLYlink written 6 weeks ago by kbiostars0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 698 users visited in the last hour