samtools sort bam file
2
1
Entering edit mode
3.5 years ago
mikysyc2016 ▴ 90

When I use samtools sort bam file, do i need to use samtool index the sorted bam file? what kind situation i need to use samtools index? Thanks!

ChIP-Seq • 20k views
12
Entering edit mode
3.5 years ago

when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file

Based on your explanation you do, the following, or something similar:

samtools view myfile.sam -o myfile.bam
samtools sort myfile.bam -o myfile_sorted.bam


That's not wrong, but it's also not necessary. Note that you can do the following in one go:

samtools sort myfile.sam -o myfile_sorted.bam


Finally, often you can also have your aligner write directly to samtools sort:

bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam


As such you can avoid quite some wasted time and intermediate files.

Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right?

A bam is a binary (compressed) format. For some applications, e.g. visualization in IGV or variant calling the tool needs 'random access' to the bam file, it needs to be able to find reads without iterating through the entire file. That's what the index is for.

We don't know what your "later analysis" is (please try to be as informative as possible when asking questions), but either it will work (no index required) or it will fail (and tell you it needs an index). Index files are generally small and created quickly, so there is no need to worry about them. Just index your bam.

How to define " coordinate sorted"?

A bam is coordinate sorted if the reads are sorted by coordinates. So reads from the beginning of the first chromosome are first in the file. That's what samtools sort is for. The opposite would be 'sorted by name' in which reads are sorted by their read ID. Aligners also generate unsorted bams.

0
Entering edit mode

Thank you for your nice reply. If i am right, once i get the "myfile_sorted.bam", I can filter it to get the mapped bam file? My later analysis is using picard to mark pcr duplicates.

0
Entering edit mode

I can filter it to get the mapped bam file?

That would be possible, but that's probably not necessary. I don't know what the aim is of your analysis though.

0
Entering edit mode

I do ChIP-seq analysis.Later i need the bam file to do peak calling.

1
Entering edit mode

You don't need to filter out unmapped reads, they'll be ignored by the peak caller (MACS2 will also ignore duplicates for you, so you don't actually need to mark them).

0
Entering edit mode

Yes, I prepare to use MACS2 to do peak calling. So your meaning is I just need sorted bam file(as below) to use MACS2?

\$ bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam

And do not underdstand why MACS2 can remove unmapped reads and duplicates, if i do not remove umapped reads and not mark the dups^^

I have another question. I see MACS2 process in two webpage. as follow : 1. https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionV/lessons/03_peak_calling_macs.html 2. https://github.com/taoliu/MACS/wiki/Advanced:-Call-peaks-using-MACS2-subcommands. do you think the first one can do all the things as the second one(the second has several subcommands and need to change certain parameters)?

0
Entering edit mode

Unmapped reads are marked in the file already, almost every program knows to skip them when appropriate. MACS2 keeps track of where reads start and internally tracks duplicates accordingly.

The link (callpeaks) is the only one you should typically use. It is itself going through the process described in the second link, which is something you should only ever do manually if absolutely needed.

Whether you filter blacklisted regions beforehand or not is up to you. You can also filter peaks overlapping blacklisted regions, which tends to be quicker.

0
Entering edit mode

Do you remove blacklist before you use MACS2?

0
Entering edit mode

I don't expect unmapped reads would interfere with that but I don't know the usual procedure of Chip-seq.

0
Entering edit mode
3.5 years ago
h.mon 33k

From the samtools man page:

 index     samtools index [-bc] [-m INT] aln.bam|aln.cram [out.index]

Index  a  coordinate-sorted  BAM or CRAM file for fast random
access.  (Note that this does not work with SAM files even if
they are bgzip compressed — to index such files, use tabix(1)

This index is needed when region arguments are used to  limit
samtools  view  and similar commands to particular regions of
interest.

0
Entering edit mode

when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file. Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right? How to define " coordinate sorted"?