Question: Loss of reads during realignment bam
0
gravatar for mb2subi
8 weeks ago by
mb2subi0
mb2subi0 wrote:

I everyone,

I am trying to realign a bam file from hg37 to hg38. The problem is that I am losing half of the reads while the process. The fastq is paired-end.

Here the samtools flagstat for both:

Raw bam (hg37)

2457420710 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
49697790 + 0 supplementary
93027061 + 0 duplicates
2453403628 + 0 mapped (99.84% : N/A)
2407722920 + 0 paired in sequencing
1203861460 + 0 read1
1203861460 + 0 read2
2354059656 + 0 properly paired (97.77% : N/A)
2400927418 + 0 with itself and mate mapped
2778420 + 0 singletons (0.12% : N/A)
33747364 + 0 with mate mapped to a different chr
17954283 + 0 with mate mapped to a different chr (mapQ>=5)

New aligned bam (hg38):

1220323060 + 0 in total (QC-passed reads + QC-failed reads)
1536660 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1198748250 + 0 mapped (98.23% : N/A)
26942266 + 0 paired in sequencing
13471133 + 0 read1
13471133 + 0 read2
4244064 + 0 properly paired (15.75% : N/A)
6058498 + 0 with itself and mate mapped
7556561 + 0 singletons (28.05% : N/A)
1669472 + 0 with mate mapped to a different chr
833174 + 0 with mate mapped to a different chr (mapQ>=5)

For the generation of fastq files from bam I am using samToBam from GATK

java -XX:ParallelGCThreads=${cores} -jar $picard SamToFastq I=$file FASTQ=$file_R1 SECOND_END_FASTQ=$file_R2

And then I made the align using bwa and sam tools for the postprocessing of the bam:

bwa mem -p -M -t $cores -R ${rg_info} -v 1 ${ref} $file_R1 $file_R2 | \
samtools view -@ $cores -Sb - | \
samtools sort -@ $cores - -o $outBam

And for marking duplicates:

java -jar -XX:ParallelGCThreads=${cores} \
    -Djava.io.tmpdir=$(dirname $outBam)/tmp \
    $picard MarkDuplicates \
    I=${outBam} O=${outBamNonExt}_mkdup.bam \
    M=${fileNonExt}_mkdup.log

samtools index -@ $cores ${outBamNonExt}_mkdup.bam

Once created the fastq I analysed using fastQValidator and I got this output:

> ERROR on Line 45: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1416:6937/1 at Lines 41 and 45 ERROR on
> Line 245: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1417:29834/1 at Lines 241 and 245 ERROR
> on Line 1013: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1420:64620/1 at Lines 1009 and 1013
> ERROR on Line 1737: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1422:44831/1 at Lines 1733 and 1737
> ERROR on Line 1893: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1422:81419/1 at Lines 1889 and 1893
> ERROR on Line 2253: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1423:73804/1 at Lines 2249 and 2253
> ERROR on Line 2321: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1423:98182/1 at Lines 2317 and 2321
> ERROR on Line 2969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1425:54077/1 at Lines 2965 and 2969
> ERROR on Line 2973: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1425:54077/1 at Lines 2965 and 2973
> ERROR on Line 3193: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1426:11367/1 at Lines 3189 and 3193
> ERROR on Line 3969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1428:27238/1 at Lines 3965 and 3969
> ERROR on Line 4153: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1428:70127/1 at Lines 4149 and 4153
> ERROR on Line 4901: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:61256/1 at Lines 4897 and 4901
> ERROR on Line 4969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:73752/1 at Lines 4965 and 4969
> ERROR on Line 5013: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:84096/1 at Lines 5009 and 5013
> ERROR on Line 5017: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:84096/1 at Lines 5009 and 5017
> ERROR on Line 5105: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:6958/1 at Lines 5101 and 5105
> ERROR on Line 5173: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:22113/1 at Lines 5169 and 5173
> ERROR on Line 5225: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:29783/1 at Lines 5221 and 5225
> ERROR on Line 5517: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:94291/1 at Lines 5513 and 5517
> Finished processing
> /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R1.fq
> with 580178304 lines containing 1218786400 sequences. There were a
> total of 14924940 errors. Returning: 1 : FASTQ_INVALID

If I look in the @RG of the original bam using

samtools view -H FI43683.bam | grep '@RG'

The output is as follows:

@RG ID:QCMG:20130118043852808   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_3
@RG ID:QCMG:20130118055238928   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_2
@RG ID:QCMG:20130118062743433   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_5
@RG ID:QCMG:20130118064849885   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_4
@RG ID:QCMG:20130118121216373   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_1
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118043852808\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_3\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_0    PN:bwa  VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118055238928\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_2\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_1    PN:bwa  PP:bamsort_0    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118062743433\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_5\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_2    PN:bwa  PP:bamsort_1    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118064849885\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_4\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_3    PN:bwa  PP:bamsort_2    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118121216373\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_1\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_4    PN:bwa  PP:bamsort_3    VN:0.7.8-r455

Whereas in the new bam (as expected) is like this:

@RG ID:FI43683  SM:FI43683  PL:ILLUMINA
@PG ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa mem -p -M -t 16 -R @RG\tID:FI43683\tSM:FI43683\tPL:ILLUMINA /imppc/labs/lplab/msubirana/../share/marc/refgen/hg38/hg38.fa /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R1.fq /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R2.fq

Any suggestions?

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by mb2subi0
3
gravatar for WouterDeCoster
8 weeks ago by
Belgium
WouterDeCoster39k wrote:

I'd suggest taking a look at BAZAM: https://github.com/ssadedin/bazam

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by WouterDeCoster39k

It worked perfectly thanks!

ADD REPLYlink written 7 weeks ago by mb2subi0
0
gravatar for ATpoint
8 weeks ago by
ATpoint18k
Germany
ATpoint18k wrote:

Make sure your BAM is name-sorted or at least shuffled/grouped (samtools sort or collate) prior to converting to fastq. BWA expects random fastq order. For conversion I have good experience with samtools fastq -n. I think the issue was that your BAM file was indeed not name-sorted. Therefore reads not actually belonging together in terms of being mates were grouped, which is reflected by the poor percentage of properly-paired reads. Repeat everything with the above suggestions and see what comes out. Maybe test with a subset of the original BAM to have things a bit faster.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by ATpoint18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1393 users visited in the last hour