Skipping intronic locations of spliced reads (N CIGAR) in sam2tsv (jvarkit)
1
0
Entering edit mode
3.8 years ago
kbiostars • 0

Hi,

I am using sam2tsv (from jvarkit) to print BAM alignments from HISAT2 as a TAB delimited file, which has been extremely useful to find the position of mutations that do not match the reference.

Since the BAM files are large, I have split them using GATK SplitSamByNumberOfReads to run sam2tsv in each of them in parallel.

However, the time it takes to run sam2tsv is still very long. Since I used a splice-aware aligner for my RNA-seq data, some reads in my BAM file span introns (denoted by the CIGAR operator N). I believe that sam2tsv is taking very long to run due to the printing of the "N" positions of the read (which sometimes comprise thousands of base pairs). For example, this might be the CIGAR information of a read in my BAM file: "56M883N39M". I believe that printing those 883N in sam2tsv is what is slowing down the process. I tried to pipe the output of sam2tsv to awk to get only "M" positions, but that is also slow, as one would expect.

Questions

  1. Is there a way to have sam2tsv skip N positions while it is printing the TAB delimited file? (as opposed to using awk afterwards to filter out rows that have an N as the CIGAR operator).

  2. As alternative, is it possible to strip the "N" nucleotides from the BAM file while keeping the "M" information of the read (i.e. the matched positions)? I am not talking about filtering out the whole spliced read. Rather, only the information about the intron, in order to speed up processing in sam2tsv.

Thank you very much!

jvarkit sam2tsv bam • 1.4k views
ADD COMMENT
0
Entering edit mode

which has been extremely useful to find the position of mutations that do not match the reference

why don't you use bcftools mpileup ?

ADD REPLY
0
Entering edit mode

I also need the name of the reads that have the mutations, the position of the mutation within the read, and the nucleotide composition of the read with the mutation. I thought that with your sam2tsv you could get all that information with a single tool quickly.

ADD REPLY
3
Entering edit mode
3.8 years ago

in https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/sam2tsv/Sam2Tsv.java#L405

replace

             case N :  //cont. -- reference skip

with

             case N: refIndex+=e.getLength(); break;
ADD COMMENT
0
Entering edit mode

Hi! Thank you so much for your quick response! I have just tested and works perfectly fine - there is no output of lines with an "N" cigar operator. I will test how this improves running times.

Thanks for developing these tools and your contributions to the community! Will make sure to include the citation to your tools!

ADD REPLY

Login before adding your answer.

Traffic: 2498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6