Question: Samtools merge Illumina and PB bam file empty
0
gravatar for talbots
6 weeks ago by
talbots0
talbots0 wrote:

Hello,

I've been attempting to use:

samtools merge -r -b list_of_bams.txt output_bam.bam

I have two samples, with each sample having Illumina Hiseq paired end .bam and PacBio paired end reads .bam and their respective readgroups @RG. The Illumina reads were aligned with bwa mem and the PB reads with minimap2. My attempts at concatenating these two samples and their respective reads seemed successful -- I got a rather large .bam file but the program is not recognizing it. And when I load it into IGV it appears empty... This leads me to believe that there is something I am doing wrong with the samtools merge or just the idea of merging Illumina and PacBio reads into the same tracked bam.

I am pretty sure the .bam format supports this type of merge, but was having difficulty finding specific documentation online. Does anyone have any thoughts on this matter?

Thank you

ADD COMMENTlink modified 6 weeks ago by h.mon31k • written 6 weeks ago by talbots0
1

Can you simply not load these as two tracks? One for illumina other for PB? If you just want to view them in IGV.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax89k

Seconding this. Would be good to know if the files work individually or if there's already a problem without them being merged.

ADD REPLYlink written 6 weeks ago by Friederike6.1k

Was the data aligned to the same exact reference genome?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax89k

Yes I should have specified. Same reference for all reads

ADD REPLYlink written 6 weeks ago by talbots0

but the program is not recognizing it.

Which program are you referring to? Do you get an error message?

And when I load it into IGV it appears empty

Did you zoom into a specific gene? How big is the combined BAM file? Did you generate an index?

ADD REPLYlink written 6 weeks ago by Friederike6.1k

Yes I zoomed into a specific gene but also just browsed in general around different scaffolds that I have. There are no reads to be seen I am not sure where the data is in this bam file...

The file size is some how 40gb. Yes I have an index.

Here is my samtools flagstats output:

527372587 + 0 in total (QC-passed reads + QC-failed reads)   
0 + 0 secondary
13227016 + 0 supplementary
52265489 + 0 duplicates
513578785 + 0 mapped (97.38% : N/A) 
514144484 + 0 paired in sequencing
257072242 + 0 read1
257072242 + 0 read2 
460676136 + 0 properly paired (89.60% : N/A) 
494867084 + 0 with itself and mate mapped 
5483598 + 0 singletons (1.07% : N/A) 
27654316 + 0 with mate mapped to a different chr 
19739915 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLYlink modified 6 weeks ago by h.mon31k • written 6 weeks ago by talbots0

I tried to format the samtools flagstat output, but I may have mangled something. Even so, it is clear your file is not empty.

And the output of samtools view output_bam.bam | head? (However, it may be too big to paste here if there are PacBio reads among the first mapped reads.)

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by h.mon31k

samtools view -H out_bam.bam would also be interesting to see

ADD REPLYlink written 6 weeks ago by Friederike6.1k

I am not sure what information you would gather from the header. I can paste a portion of it after the scaffolds... What would you be looking for in the header? It just says which scaffolds, @SQ , readgroups @RG and @PG. I can't paste the header because it is too large, exceeding over 7000 characters.

ADD REPLYlink written 6 weeks ago by talbots0

Well, the diagnostics are a bit difficult as you can tell by all the comments. I can't tell you what precisely I'd be looking for in the header, but for some clue that may explain the issue.

Have you extracted an individual read from the BAM file and tried to specifically zoom into that area in IGV? Can you take a screenshot of that?

ADD REPLYlink written 6 weeks ago by Friederike6.1k

Yes, before I concatenated all the bam files I checked each one to make sure the individual reads within each bam file were viewable in IGV.

I am not sure how to extract an individual read from my concatenated bam file... would it be a samtools command? I Just checked the header again, it doesn't look like anything specific stands out but I will paste a portion.

@PG     ID:samtools-76F6CF77    PN:samtools     PP:bwa-1C25FDEE VN:1.10 CL:samtools view -bS PB_bwa_mem_414.sam
@PG     ID:samtools.1-7D2A4DBF  PN:samtools     PP:samtools-76F6CF77    VN:1.10 CL:samtools sort Jefferson_414_PB_ALIGNMENT_bwamem.bam
@PG     ID:samtools.2-16A0E06B  PN:samtools     PP:samtools.1-7D2A4DBF  VN:1.10 CL:samtools view -h -F 4 Jefferson_414_PB_ALIGNMENT_bwamem_SORTED.bam
@PG     ID:MarkDuplicates-6AABD8D9      VN:2.21.8-SNAPSHOT      CL:MarkDuplicates INPUT=[Jefferson_414_PB_ALIGNMENT_bwamem_SORTED_MAPPED.bam] OUTPUT=dedup.Jefferson_414_PB_ALIGNMENT_SORTED_MAPPED_bwamem.bam METRICS_FILE=PB_marked_dup_metrics_414_bwamem.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false    PN:MarkDuplicates
@PG     ID:bwa-42C2C91  PN:bwa  VN:0.7.12-r1039 CL:bwa mem -t 8 -R @RG\tID:OSU_252.146\tSM:paired_sample252_1\tLB:lib1_252\tPL:illumina path_edited_out 252146_HiseqPE150_all_R1_cleaned.fq.gz 252146_HiseqPE150_all_R2_cleaned.fq.gz
@PG     ID:MarkDuplicates-AF88DE6       VN:2.21.8-SNAPSHOT      CL:MarkDuplicates INPUT=[Jeff_252_readgroupsaln2_HiSeq_aln.bam] OUTPUT=dedup.Jeff_252_readgroupsaln2_HiSeq_aln.bam METRICS_FILE=marked_dup_metrics_252.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false  PN:MarkDuplicates
@PG     ID:bwa-78A7D7B1 PN:bwa  VN:0.7.12-r1039 CL:bwa mem -t 8 -R @RG\tID:OSU_414.062\tSM:paired_sample414_1\tLB:lib1_414\tPL:illumina path_edited_out 414062_HiseqPE150_all_R1_cleaned.fq.gz 414062_HiseqPE150_all_R2_cleaned.fq.gz
@PG     ID:MarkDuplicates-4C442CB       VN:2.21.8-SNAPSHOT      CL:MarkDuplicates INPUT=[Jeff_414_readgroupsaln2_HiSeq_aln.bam] OUTPUT=dedup.Jeff_414_readgroupsaln2_HiSeq_aln.bam METRICS_FILE=marked_dup_metrics_414.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false  PN:MarkDuplicates
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by talbots0

how to extract an individual read from my concatenated bam file

samtools view BAMFILE.bam | head will yield some to get you started

ADD REPLYlink written 6 weeks ago by Friederike6.1k

I am not sure how to extract an individual read from my concatenated bam file.

I think what @Friederike means is just find reads in your BAM files that are aligned to a chromosome (if there are both illumina and PB reads next to each other great) and then try going to that region by typing the coordinates in IGV search window (e.g. chr5:2345-5000).

ADD REPLYlink written 6 weeks ago by genomax89k

Yes, indeed, that's what I tried to suggest. This problem really would benefit from an interactive session ;)

ADD REPLYlink written 6 weeks ago by Friederike6.1k

What samtools flagstats output_bam.bam show? Or samtools view output_bam.bam | head?

ADD REPLYlink written 6 weeks ago by h.mon31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1605 users visited in the last hour