Loss of reads during realignment bam
2
0
Entering edit mode
5.0 years ago
mb2subi ▴ 10

I everyone,

I am trying to realign a bam file from hg37 to hg38. The problem is that I am losing half of the reads while the process. The fastq is paired-end.

Here the samtools flagstat for both:

Raw bam (hg37)

2457420710 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
49697790 + 0 supplementary
93027061 + 0 duplicates
2453403628 + 0 mapped (99.84% : N/A)
2407722920 + 0 paired in sequencing
1203861460 + 0 read1
1203861460 + 0 read2
2354059656 + 0 properly paired (97.77% : N/A)
2400927418 + 0 with itself and mate mapped
2778420 + 0 singletons (0.12% : N/A)
33747364 + 0 with mate mapped to a different chr
17954283 + 0 with mate mapped to a different chr (mapQ>=5)

New aligned bam (hg38):

1220323060 + 0 in total (QC-passed reads + QC-failed reads)
1536660 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1198748250 + 0 mapped (98.23% : N/A)
26942266 + 0 paired in sequencing
13471133 + 0 read1
13471133 + 0 read2
4244064 + 0 properly paired (15.75% : N/A)
6058498 + 0 with itself and mate mapped
7556561 + 0 singletons (28.05% : N/A)
1669472 + 0 with mate mapped to a different chr
833174 + 0 with mate mapped to a different chr (mapQ>=5)

For the generation of fastq files from bam I am using samToBam from GATK

java -XX:ParallelGCThreads=${cores} -jar $picard SamToFastq I=$file FASTQ=$file_R1 SECOND_END_FASTQ=$file_R2

And then I made the align using bwa and sam tools for the postprocessing of the bam:

bwa mem -p -M -t $cores -R ${rg_info} -v 1 ${ref} $file_R1 $file_R2 | \
samtools view -@ $cores -Sb - | \
samtools sort -@ $cores - -o $outBam

And for marking duplicates:

java -jar -XX:ParallelGCThreads=${cores} \
    -Djava.io.tmpdir=$(dirname $outBam)/tmp \
    $picard MarkDuplicates \
    I=${outBam} O=${outBamNonExt}_mkdup.bam \
    M=${fileNonExt}_mkdup.log

samtools index -@ $cores ${outBamNonExt}_mkdup.bam

Once created the fastq I analysed using fastQValidator and I got this output:

> ERROR on Line 45: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1416:6937/1 at Lines 41 and 45 ERROR on
> Line 245: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1417:29834/1 at Lines 241 and 245 ERROR
> on Line 1013: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1420:64620/1 at Lines 1009 and 1013
> ERROR on Line 1737: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1422:44831/1 at Lines 1733 and 1737
> ERROR on Line 1893: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1422:81419/1 at Lines 1889 and 1893
> ERROR on Line 2253: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1423:73804/1 at Lines 2249 and 2253
> ERROR on Line 2321: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1423:98182/1 at Lines 2317 and 2321
> ERROR on Line 2969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1425:54077/1 at Lines 2965 and 2969
> ERROR on Line 2973: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1425:54077/1 at Lines 2965 and 2973
> ERROR on Line 3193: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1426:11367/1 at Lines 3189 and 3193
> ERROR on Line 3969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1428:27238/1 at Lines 3965 and 3969
> ERROR on Line 4153: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1428:70127/1 at Lines 4149 and 4153
> ERROR on Line 4901: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:61256/1 at Lines 4897 and 4901
> ERROR on Line 4969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:73752/1 at Lines 4965 and 4969
> ERROR on Line 5013: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:84096/1 at Lines 5009 and 5013
> ERROR on Line 5017: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:84096/1 at Lines 5009 and 5017
> ERROR on Line 5105: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:6958/1 at Lines 5101 and 5105
> ERROR on Line 5173: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:22113/1 at Lines 5169 and 5173
> ERROR on Line 5225: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:29783/1 at Lines 5221 and 5225
> ERROR on Line 5517: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:94291/1 at Lines 5513 and 5517
> Finished processing
> /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R1.fq
> with 580178304 lines containing 1218786400 sequences. There were a
> total of 14924940 errors. Returning: 1 : FASTQ_INVALID

If I look in the @RG of the original bam using

samtools view -H FI43683.bam | grep '@RG'

The output is as follows:

@RG ID:QCMG:20130118043852808   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_3
@RG ID:QCMG:20130118055238928   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_2
@RG ID:QCMG:20130118062743433   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_5
@RG ID:QCMG:20130118064849885   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_4
@RG ID:QCMG:20130118121216373   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_1
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118043852808\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_3\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_0    PN:bwa  VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118055238928\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_2\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_1    PN:bwa  PP:bamsort_0    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118062743433\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_5\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_2    PN:bwa  PP:bamsort_1    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118064849885\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_4\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_3    PN:bwa  PP:bamsort_2    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118121216373\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_1\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_4    PN:bwa  PP:bamsort_3    VN:0.7.8-r455

Whereas in the new bam (as expected) is like this:

@RG ID:FI43683  SM:FI43683  PL:ILLUMINA
@PG ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa mem -p -M -t 16 -R @RG\tID:FI43683\tSM:FI43683\tPL:ILLUMINA /imppc/labs/lplab/msubirana/../share/marc/refgen/hg38/hg38.fa /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R1.fq /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R2.fq

Any suggestions?

gatk bwa wgs fastQValidator samtools • 1.9k views
ADD COMMENT
3
Entering edit mode
5.0 years ago

I'd suggest taking a look at BAZAM: https://github.com/ssadedin/bazam

ADD COMMENT
0
Entering edit mode

It worked perfectly thanks!

ADD REPLY
0
Entering edit mode

This might be a good place for me to pose my extreme-newbie question. My hg19 full genome was produced for me by Dante Labs, and my next step is to convert (re-align?) one of the resultant files into 'hg38' for optimal results when I submit it to 'YFull'.

I set my Linux machine to use the Python 'crossmap' utility for this purpose, but I have now been informed that using this tool to re-align my 'BAM' file this way is not the correct path to use -- that I should use my 'FASTQ' files instead. So now I have been seeking out some utilities to re-align my FASTQ files from 'hg19' to 'hg38' instead -- and also found references to 'BAZAM', though none of the examples I have found (so far) specifically mention this type of conversion. Your positive response makes it seem that you were successful. Can you provide some tips? Thanks!

ADD REPLY
0
Entering edit mode

To convert a hg19 bam to hg38 you would indeed use bazam. To align reads in fastq format to GRCh38 you would use bwa mem.

I believe bazam uses bwa mem under the hood, and handles the correct conversion to fastq so you don't have to worry about those steps.

ADD REPLY
0
Entering edit mode

Thanks! Since Dante Labs provided the FASTQ files, I am going to attempt to use 'bwa mem' directly to format to GRCh38. All these large files are giving my HD and CPU a good workout! :)

ADD REPLY
0
Entering edit mode
5.0 years ago
ATpoint 81k

Make sure your BAM is name-sorted or at least shuffled/grouped (samtools sort or collate) prior to converting to fastq. BWA expects random fastq order. For conversion I have good experience with samtools fastq -n. I think the issue was that your BAM file was indeed not name-sorted. Therefore reads not actually belonging together in terms of being mates were grouped, which is reflected by the poor percentage of properly-paired reads. Repeat everything with the above suggestions and see what comes out. Maybe test with a subset of the original BAM to have things a bit faster.

ADD COMMENT

Login before adding your answer.

Traffic: 2813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6