Hi,
I am using sam2tsv (from jvarkit) to print BAM alignments from HISAT2 as a TAB delimited file, which has been extremely useful to find the position of mutations that do not match the reference.
Since the BAM files are large, I have split them using GATK SplitSamByNumberOfReads to run sam2tsv in each of them in parallel.
However, the time it takes to run sam2tsv is still very long. Since I used a splice-aware aligner for my RNA-seq data, some reads in my BAM file span introns (denoted by the CIGAR operator N). I believe that sam2tsv is taking very long to run due to the printing of the "N" positions of the read (which sometimes comprise thousands of base pairs). For example, this might be the CIGAR information of a read in my BAM file: "56M883N39M". I believe that printing those 883N in sam2tsv is what is slowing down the process. I tried to pipe the output of sam2tsv to awk to get only "M" positions, but that is also slow, as one would expect.
Questions
Is there a way to have sam2tsv skip N positions while it is printing the TAB delimited file? (as opposed to using awk afterwards to filter out rows that have an N as the CIGAR operator).
As alternative, is it possible to strip the "N" nucleotides from the BAM file while keeping the "M" information of the read (i.e. the matched positions)? I am not talking about filtering out the whole spliced read. Rather, only the information about the intron, in order to speed up processing in sam2tsv.
Thank you very much!
why don't you use
bcftools mpileup
?I also need the name of the reads that have the mutations, the position of the mutation within the read, and the nucleotide composition of the read with the mutation. I thought that with your sam2tsv you could get all that information with a single tool quickly.