Question: samtools sort bam file
1
gravatar for mikysyc2016
2.2 years ago by
mikysyc201670
mikysyc201670 wrote:

When I use samtools sort bam file, do i need to use samtool index the sorted bam file? what kind situation i need to use samtools index? Thanks!

chip-seq • 8.4k views
ADD COMMENTlink modified 2.2 years ago by WouterDeCoster44k • written 2.2 years ago by mikysyc201670
10
gravatar for WouterDeCoster
2.2 years ago by
Belgium
WouterDeCoster44k wrote:

when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file

Based on your explanation you do, the following, or something similar:

samtools view myfile.sam -o myfile.bam
samtools sort myfile.bam -o myfile_sorted.bam

That's not wrong, but it's also not necessary. Note that you can do the following in one go:

samtools sort myfile.sam -o myfile_sorted.bam

Finally, often you can also have your aligner write directly to samtools sort:

bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam

As such you can avoid quite some wasted time and intermediate files.

Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right?

A bam is a binary (compressed) format. For some applications, e.g. visualization in IGV or variant calling the tool needs 'random access' to the bam file, it needs to be able to find reads without iterating through the entire file. That's what the index is for.

We don't know what your "later analysis" is (please try to be as informative as possible when asking questions), but either it will work (no index required) or it will fail (and tell you it needs an index). Index files are generally small and created quickly, so there is no need to worry about them. Just index your bam.

How to define " coordinate sorted"?

A bam is coordinate sorted if the reads are sorted by coordinates. So reads from the beginning of the first chromosome are first in the file. That's what samtools sort is for. The opposite would be 'sorted by name' in which reads are sorted by their read ID. Aligners also generate unsorted bams.

ADD COMMENTlink written 2.2 years ago by WouterDeCoster44k

Thank you for your nice reply. If i am right, once i get the "myfile_sorted.bam", I can filter it to get the mapped bam file? My later analysis is using picard to mark pcr duplicates.

ADD REPLYlink written 2.2 years ago by mikysyc201670

I can filter it to get the mapped bam file?

That would be possible, but that's probably not necessary. I don't know what the aim is of your analysis though.

ADD REPLYlink written 2.2 years ago by WouterDeCoster44k

I do ChIP-seq analysis.Later i need the bam file to do peak calling.

ADD REPLYlink written 2.2 years ago by mikysyc201670
1

You don't need to filter out unmapped reads, they'll be ignored by the peak caller (MACS2 will also ignore duplicates for you, so you don't actually need to mark them).

ADD REPLYlink written 2.2 years ago by Devon Ryan96k

Yes, I prepare to use MACS2 to do peak calling. So your meaning is I just need sorted bam file(as below) to use MACS2?

$ bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam

And do not underdstand why MACS2 can remove unmapped reads and duplicates, if i do not remove umapped reads and not mark the dups^^

I have another question. I see MACS2 process in two webpage. as follow : 1. https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionV/lessons/03_peak_calling_macs.html 2. https://github.com/taoliu/MACS/wiki/Advanced:-Call-peaks-using-MACS2-subcommands. do you think the first one can do all the things as the second one(the second has several subcommands and need to change certain parameters)?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by mikysyc201670

Unmapped reads are marked in the file already, almost every program knows to skip them when appropriate. MACS2 keeps track of where reads start and internally tracks duplicates accordingly.

The link (callpeaks) is the only one you should typically use. It is itself going through the process described in the second link, which is something you should only ever do manually if absolutely needed.

Whether you filter blacklisted regions beforehand or not is up to you. You can also filter peaks overlapping blacklisted regions, which tends to be quicker.

ADD REPLYlink written 2.2 years ago by Devon Ryan96k

Do you remove blacklist before you use MACS2?

ADD REPLYlink written 2.2 years ago by mikysyc201670

I don't expect unmapped reads would interfere with that but I don't know the usual procedure of Chip-seq.

ADD REPLYlink written 2.2 years ago by WouterDeCoster44k
0
gravatar for h.mon
2.2 years ago by
h.mon30k
Brazil
h.mon30k wrote:

Please read How To Ask Good Questions On Technical And Scientific Forums. It will help you formulate your questions better, and maybe even it will help you to solve your problems during a more complete elaboration of your questions.

From the samtools man page:

 index     samtools index [-bc] [-m INT] aln.bam|aln.cram [out.index]

     Index  a  coordinate-sorted  BAM or CRAM file for fast random
     access.  (Note that this does not work with SAM files even if
     they are bgzip compressed — to index such files, use tabix(1)
     instead.)

     This index is needed when region arguments are used to  limit
     samtools  view  and similar commands to particular regions of
     interest.
ADD COMMENTlink written 2.2 years ago by h.mon30k

when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file. Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right? How to define " coordinate sorted"?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by mikysyc201670
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1722 users visited in the last hour