Question: How to merge/combine previously split bam files?
0
gravatar for ycsm
20 months ago by
ycsm10
ycsm10 wrote:

Another newbie I'm afraid ...... I have been sent my bam file data split into 5 parts (total around 69gb) and wondered how to merge the files back into 1 bam file. I have installed samtools but there are various options in there and not sure which ones I should use if any? Also how can I verify that it has been done correctly? I have my corresponding VCF file so do I make a VCF file from the merged bam file and make sure the new and old are the same? Sorry for the newbie questions!

ADD COMMENTlink modified 20 months ago • written 20 months ago by ycsm10
1

Do you have 5 VCF files or just one? Was the bam split into pieces for ease of delivery/download (i.e. all the analysis done when it was intact)? Finally why do you want to combine the files? For some other analysis? samtools merge would be the simplest way. You may need to sort/index the file after merging depending on what the pieces were like.

ADD REPLYlink modified 20 months ago by h.mon24k • written 20 months ago by genomax63k

Thank you. I have just one VCF file. Yes it was split for ease of delivery with the analysis already done. Yes I want to combine the files for analysis. Do I merge the files as this "samtools merge out.bam in1.bam in2.bam in3.bam" without any options? Was not sure about headers etc. How will I know if I have to sort/index the file afterwards? Thank you.

These are the options I don't know if I need to use .........

OPTIONS:

-1 Use zlib compression level 1 to compress the output.

-b FILE List of input BAM files, one file per line.

-f Force to overwrite the output file if present.

-h FILE Use the lines of FILE as `@' headers to be copied to out.bam, replacing any header lines that would otherwise be copied from in1.bam. (FILE is actually in SAM format, though any alignment records it may contain are ignored.)

-n The input alignments are sorted by read names rather than by chromosomal coordinates

-t TAG The input alignments have been sorted by the value of TAG, then by either position or name (if -n is given).

-R STR Merge files in the specified region indicated by STR [null]

-r Attach an RG tag to each alignment. The tag value is inferred from file names.

-u Uncompressed BAM output

-c When several input files contain @RG headers with the same ID, emit only one of them (namely, the header line from the first file we find that ID in) to the merged output file. Combining these similar headers is usually the right thing to do when the files being merged originated from the same file.

Without -c, all @RG headers appear in the output file, with random suffixes added to their IDs where necessary to differentiate them.

-p Similarly, for each @PG ID in the set of files to merge, use the @PG line of the first file we find that ID in rather than adding a suffix to differentiate similar IDs.

ADD REPLYlink modified 20 months ago • written 20 months ago by ycsm10

Thank you. I have installed Samtools 1.5. Do I merge the files as this "samtools merge out.bam in1.bam in2.bam in3.bam" without any options? Was not sure about headers etc.

ADD REPLYlink written 20 months ago by ycsm10
1

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 20 months ago by genomax63k

Thank you. I have just one VCF file. Yes it was split for ease of delivery with the analysis already done. Yes I want to combine the files for analysis. Do I merge the files as this "samtools merge out.bam in1.bam in2.bam in3.bam" without any options? Was not sure about headers etc. How will I know if I have to sort/index the file afterwards? Thank you.

These are the options I don't know if I need to use .........

OPTIONS:

-1 Use zlib compression level 1 to compress the output.

-b FILE List of input BAM files, one file per line.

-f Force to overwrite the output file if present.

-h FILE Use the lines of FILE as `@' headers to be copied to out.bam, replacing any header lines that would otherwise be copied from in1.bam. (FILE is actually in SAM format, though any alignment records it may contain are ignored.)

-n The input alignments are sorted by read names rather than by chromosomal coordinates

-t TAG The input alignments have been sorted by the value of TAG, then by either position or name (if -n is given).

-R STR Merge files in the specified region indicated by STR [null]

-r Attach an RG tag to each alignment. The tag value is inferred from file names.

-u Uncompressed BAM output

-c When several input files contain @RG headers with the same ID, emit only one of them (namely, the header line from the first file we find that ID in) to the merged output file. Combining these similar headers is usually the right thing to do when the files being merged originated from the same file.

Without -c, all @RG headers appear in the output file, with random suffixes added to their IDs where necessary to differentiate them.

-p Similarly, for each @PG ID in the set of files to merge, use the @PG line of the first file we find that ID in rather than adding a suffix to differentiate similar IDs.

ADD REPLYlink written 20 months ago by ycsm10
1

Start with the simple command you had posted above (only option would be to add is more than 1 threads, if you have access to a multi-core machine and fast file system, otherwise don't). Do you know if the individual BAM have headers (and if they are identical, check by samtools view -H)?

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax63k

Thank you. Yes the 5 bam headers are the same. Does that alter anything? Thank you for helping.

ADD REPLYlink written 20 months ago by ycsm10
1

If the merge command does not add a header then you could add -h file1.bam as the file to use the header from.

ADD REPLYlink written 20 months ago by genomax63k

Thank you. Samtools merged the files changing 5 part files (69gb) to 1 bam file (56gb) and it has a header. How do I know if I need to index and sort? Is my confidence (that I have done it right) gained by converting the finished BAM to VCF and see if it matches the one I have? Thank you so much for all your help.

ADD REPLYlink written 20 months ago by ycsm10
1

Just to be safe sort and index the file (especially if you are planning to do some analysis on it). samtools sort -o sorted_file.bam original.bam should do it. Make sure you have enough RAM/storage available in the directory where you sort since there would be temp files written in the process (that should get removed once sorting is successful).

You can try generating VCF again. Make sure you are using the same software versions/command line options as used originally.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax63k

Thank you so much for all the information - I'll try that. Will Samtools deconstruct my bam file and return it to its original state before the Human Genome Reference38 was applied to it? Is that converting it to fastq? My thoughts are that I want to leave my 'virgin' bam file with my children and grandchildren for their possible use in the future. Once again thank you for all your help and support. I would still be staring at the screen without your help.

ADD REPLYlink written 20 months ago by ycsm10
1

Will Samtools deconstruct my bam file and return it to its original state before the Human Genome Reference38 was applied to it? Is that converting it to fastq?

You could do that with samtools fastq command. This will allow you to recover the fastq format sequence files.

ADD REPLYlink written 20 months ago by genomax63k

Thank you so much. You've been very helpful. I'll try and do all that. Thank you.

ADD REPLYlink written 20 months ago by ycsm10

With your help, I think I am ready to run fastq now. I ran flagstat on my sorted bam with the results below. Hopefully they are good. Should I be using any of the options that are available with fastq or just run it as samtools fastq? Thank you so much for all your help.

$ samtools flagstat mydata_sorted.bam
839667458 + 0 in total (QC-passed reads + QC-failed reads)
23306152 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
466993833 + 0 mapped (55.62% : N/A)
816361306 + 0 paired in sequencing
408091264 + 0 read1
408270042 + 0 read2
430841954 + 0 properly paired (52.78% : N/A)
431524999 + 0 with itself and mate mapped
12162682 + 0 singletons (1.49% : N/A)
213835 + 0 with mate mapped to a different chr
213835 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLYlink modified 20 months ago by genomax63k • written 20 months ago by ycsm10

Trying to make this easier for you to read .......

$ samtools flagstat mydata_sorted.bam

839667458 + 0 in total (QC-passed reads + QC-failed reads)

23306152 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

466993833 + 0 mapped (55.62% : N/A)

816361306 + 0 paired in sequencing

408091264 + 0 read1

408270042 + 0 read2

430841954 + 0 properly paired (52.78% : N/A)

431524999 + 0 with itself and mate mapped

12162682 + 0 singletons (1.49% : N/A)

213835 + 0 with mate mapped to a different chr

213835 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLYlink written 20 months ago by ycsm10

Why do you want to go back to the fastq files this way? Can't you ask the provider to give them to you?

ADD REPLYlink written 20 months ago by genomax63k

I have asked and they say that unfortunately they are unable to help past what they have provided. Do the flagstat results look good? Hopefully I am only one step away to the fastq version.

ADD REPLYlink written 20 months ago by ycsm10

Fastq data is just raw sequences (where one starts analysis) so you are sort of going backwards here. VCF is a final product of analysis of aligned fastq data (which is in BAM files).

Mapping percentage looks low(er) than what I expect for human whole genome data (which is what I assume this is). You can try Qualimap with you BAM file to get more visual look into your BAM.

ADD REPLYlink written 20 months ago by genomax63k

Thank you. My thoughts were to have a fastq version that I can leave my children and grandchildren. In the future they might want to refer back to it. To me it makes sense to have the raw data (rather than manipulated) so they can run whatever software is available then against it. There appears to be lots of informative software available to the public to run against 123andme data but not so much for raw data but I am sure that will change in years to come? What would the expected 'norm' mapping percentage be. Should I be concerned about it being lower? Can't thank you enough for all your help and advice.

ADD REPLYlink written 20 months ago by ycsm10

I ran the Qualimap that you suggested with the following results. Have you any comments on them? Thank you so much for your your help and advice.

File type - Conventional base calls, Encoding - Sanger / Illumina 1.9, Total Sequences - 839667458, Sequences flagged as poor quality - 0, Sequence length - 35-151, %GC - 44

PASS - Basic Statistics, PASS - Per base sequence quality, PASS - Per sequence quality scores, FAIL - Per base sequence content, WARN - Per sequence GC content, PASS - Per base N content, WARN - Sequence Length Distribution, WARN -Sequence Duplication Levels, PASS - Overrepresented sequences, PASS - Adapter Content, FAIL - Kmer Content

ADD REPLYlink modified 20 months ago • written 20 months ago by ycsm10
1
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

use samtools merge http://www.htslib.org/doc/samtools.html or picard MergeSamFiles https://broadinstitute.github.io/picard/command-line-overview.html#MergeSamFiles

ADD COMMENTlink written 20 months ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1611 users visited in the last hour