Question

Skipping intronic locations of spliced reads (N CIGAR) in sam2tsv (jvarkit)

0

Entering edit mode

3.8 years ago

kbiostars • 0

Hi,

I am using sam2tsv (from jvarkit) to print BAM alignments from HISAT2 as a TAB delimited file, which has been extremely useful to find the position of mutations that do not match the reference.

Since the BAM files are large, I have split them using GATK SplitSamByNumberOfReads to run sam2tsv in each of them in parallel.

However, the time it takes to run sam2tsv is still very long. Since I used a splice-aware aligner for my RNA-seq data, some reads in my BAM file span introns (denoted by the CIGAR operator N). I believe that sam2tsv is taking very long to run due to the printing of the "N" positions of the read (which sometimes comprise thousands of base pairs). For example, this might be the CIGAR information of a read in my BAM file: "56M883N39M". I believe that printing those 883N in sam2tsv is what is slowing down the process. I tried to pipe the output of sam2tsv to awk to get only "M" positions, but that is also slow, as one would expect.

Questions

Is there a way to have sam2tsv skip N positions while it is printing the TAB delimited file? (as opposed to using awk afterwards to filter out rows that have an N as the CIGAR operator).
As alternative, is it possible to strip the "N" nucleotides from the BAM file while keeping the "M" information of the read (i.e. the matched positions)? I am not talking about filtering out the whole spliced read. Rather, only the information about the intron, in order to speed up processing in sam2tsv.

Thank you very much!

jvarkit sam2tsv bam • 1.4k views

ADD COMMENT • link updated 3.8 years ago by Pierre Lindenbaum 161k • written 3.8 years ago by kbiostars • 0

0

Entering edit mode

which has been extremely useful to find the position of mutations that do not match the reference

why don't you use bcftools mpileup ?

ADD REPLY • link 3.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I also need the name of the reads that have the mutations, the position of the mutation within the read, and the nucleotide composition of the read with the mutation. I thought that with your sam2tsv you could get all that information with a single tool quickly.

ADD REPLY • link 3.8 years ago by kbiostars • 0

score 3 · Accepted Answer · 2020-07-04

3

Entering edit mode

3.8 years ago

Pierre Lindenbaum 161k

in https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/sam2tsv/Sam2Tsv.java#L405

replace

             case N :  //cont. -- reference skip

with

             case N: refIndex+=e.getLength(); break;

ADD COMMENT • link 3.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi! Thank you so much for your quick response! I have just tested and works perfectly fine - there is no output of lines with an "N" cigar operator. I will test how this improves running times.

Thanks for developing these tools and your contributions to the community! Will make sure to include the citation to your tools!

ADD REPLY • link 3.8 years ago by kbiostars • 0