Integrity of bam files
1
0
Entering edit mode
4 weeks ago
SHN ▴ 40

Hello All,

I used the code below to align my sample reads, running 10 parallel jobs and converting the aligned reads directly into sorted BAM files.

ls *_R1_val_1.fq.gz | sed 's/_R1_val_1.fq.gz//g' | sort -u | parallel -j 10 '
hisat2 --summary-file "'"$output_mapped"'"/{1}.hisat2.summary \
--dta -x /home/grch38_tran/genome_tran \
-q -1 {1}_R1_val_1.fq.gz -2 {1}_R2_val_2.fq.gz \
| samtools sort -o "'"$output_mapped"'"/{1}_sorted.bam

'

While reviewing the output for my first sample, I noticed that initially 32 temporary BAM files were generated. However, upon checking the directory again, I found that temporary files 1 to 30 had been removed, and only temporary files 31 to 36 remained. I'm aware that samtools removes intermediate files during the merging process, but I want to ensure that no data was lost during alignment—particularly due to potential memory issues.

Given that no index files were generated, could you please advise how I can verify the integrity and completeness of the resulting BAM files?

Thanks

integrity RAN-seq bam samtools • 2.2k views
ADD COMMENT
1
Entering edit mode
4 weeks ago
GenoMax 153k

but I want to ensure that no data was lost during alignment—particularly due to potential memory issues.

Did any samples complete alignment all the way or the very first sample led to this?

Not what you want to hear ... but in a case like this it would be best to redo the analysis than try to salvage anything. If your job got killed/died because of exhausting the resources, reduce the parallel jobs so less resources are needed. You should also reallocate some cores to the sort part to manage the input being ingested from the alignment.

ADD COMMENT
0
Entering edit mode

I'd even run rather a for loop instead of a prarallel job, but using hisat2's and samtools' multi-threading capabilities.

ADD REPLY
0
Entering edit mode

Thanks for your response. I performed alignment summary

 47714869 reads; of these:\
  47714869 (100.00%) were paired; of these:\
    1681285 (3.52%) aligned concordantly 0 times\
    44920489 (94.14%) aligned concordantly exactly 1 time\
    1113095 (2.33%) aligned concordantly >1 times\
    ----\
    1681285 pairs aligned concordantly 0 times; of these:\
      286095 (17.02%) aligned discordantly 1 time\
    ----\
    1395190 pairs aligned 0 times concordantly or discordantly; of these:\
      2790380 mates make up the pairs; of these:\
        1510540 (54.13%) aligned 0 times\
        1204832 (43.18%) aligned exactly 1 time\
        75008 (2.69%) aligned >1 times\
98.42% overall alignment rate\

I ran this code to check on the stats of the bam file

    samtools flagstat sampl_sorted.bam

102689150 + 0 in total (QC-passed reads + QC-failed reads)\
95429738 + 0 primary\
7259412 + 0 secondary\
0 + 0 supplementary\
0 + 0 duplicates\
0 + 0 primary duplicates\
101178610 + 0 mapped (98.53% : N/A)\
93919198 + 0 primary mapped (98.42% : N/A)\
95429738 + 0 paired in sequencing\
47714869 + 0 read1\
47714869 + 0 read2\
92067168 + 0 properly paired (96.48% : N/A)\
92975884 + 0 with itself and mate mapped\
943314 + 0 singletons (0.99% : N/A)\
758356 + 0 with mate mapped to a different chr\
725555 + 0 with mate mapped to a different chr (mapQ>=5)\

Considering this result, is it possible that I still have truncation?

ADD REPLY
1
Entering edit mode

While the read numbers seem to match, I would be weary of using data where the analysis step did not complete normally.

ADD REPLY
0
Entering edit mode

Thank you for your information, I'll follow along and make justifications in my code.

ADD REPLY

Login before adding your answer.

Traffic: 3365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6