Question

Integrity of bam files

0

Entering edit mode

11 weeks ago

SHN ▴ 40

Hello All,

I used the code below to align my sample reads, running 10 parallel jobs and converting the aligned reads directly into sorted BAM files.

ls *_R1_val_1.fq.gz | sed 's/_R1_val_1.fq.gz//g' | sort -u | parallel -j 10 '
hisat2 --summary-file "'"$output_mapped"'"/{1}.hisat2.summary \
--dta -x /home/grch38_tran/genome_tran \
-q -1 {1}_R1_val_1.fq.gz -2 {1}_R2_val_2.fq.gz \
| samtools sort -o "'"$output_mapped"'"/{1}_sorted.bam

'

While reviewing the output for my first sample, I noticed that initially 32 temporary BAM files were generated. However, upon checking the directory again, I found that temporary files 1 to 30 had been removed, and only temporary files 31 to 36 remained. I'm aware that samtools removes intermediate files during the merging process, but I want to ensure that no data was lost during alignment—particularly due to potential memory issues.

Given that no index files were generated, could you please advise how I can verify the integrity and completeness of the resulting BAM files?

Thanks

integrity RAN-seq bam samtools • 2.3k views

ADD COMMENT • link 11 weeks ago by SHN ▴ 40

GenoMax · Answer 1 · 2025-08-13

1

Entering edit mode

11 weeks ago

GenoMax 154k

but I want to ensure that no data was lost during alignment—particularly due to potential memory issues.

Did any samples complete alignment all the way or the very first sample led to this?

Not what you want to hear ... but in a case like this it would be best to redo the analysis than try to salvage anything. If your job got killed/died because of exhausting the resources, reduce the parallel jobs so less resources are needed. You should also reallocate some cores to the sort part to manage the input being ingested from the alignment.

ADD COMMENT • link 11 weeks ago by GenoMax 154k

0

Entering edit mode

I'd even run rather a for loop instead of a prarallel job, but using hisat2's and samtools' multi-threading capabilities.

ADD REPLY • link 11 weeks ago by michael.ante ★ 4.0k

0

Entering edit mode

Thanks for your response. I performed alignment summary

 47714869 reads; of these:\
  47714869 (100.00%) were paired; of these:\
    1681285 (3.52%) aligned concordantly 0 times\
    44920489 (94.14%) aligned concordantly exactly 1 time\
    1113095 (2.33%) aligned concordantly >1 times\
    ----\
    1681285 pairs aligned concordantly 0 times; of these:\
      286095 (17.02%) aligned discordantly 1 time\
    ----\
    1395190 pairs aligned 0 times concordantly or discordantly; of these:\
      2790380 mates make up the pairs; of these:\
        1510540 (54.13%) aligned 0 times\
        1204832 (43.18%) aligned exactly 1 time\
        75008 (2.69%) aligned >1 times\
98.42% overall alignment rate\

I ran this code to check on the stats of the bam file

    samtools flagstat sampl_sorted.bam

102689150 + 0 in total (QC-passed reads + QC-failed reads)\
95429738 + 0 primary\
7259412 + 0 secondary\
0 + 0 supplementary\
0 + 0 duplicates\
0 + 0 primary duplicates\
101178610 + 0 mapped (98.53% : N/A)\
93919198 + 0 primary mapped (98.42% : N/A)\
95429738 + 0 paired in sequencing\
47714869 + 0 read1\
47714869 + 0 read2\
92067168 + 0 properly paired (96.48% : N/A)\
92975884 + 0 with itself and mate mapped\
943314 + 0 singletons (0.99% : N/A)\
758356 + 0 with mate mapped to a different chr\
725555 + 0 with mate mapped to a different chr (mapQ>=5)\

Considering this result, is it possible that I still have truncation?

ADD REPLY • link updated 11 weeks ago by GenoMax 154k • written 11 weeks ago by SHN ▴ 40

1

Entering edit mode

While the read numbers seem to match, I would be weary of using data where the analysis step did not complete normally.

ADD REPLY • link 11 weeks ago by GenoMax 154k

0

Entering edit mode

Thank you for your information, I'll follow along and make justifications in my code.

ADD REPLY • link 11 weeks ago by SHN ▴ 40