Question: samtools sort bam file
0
gravatar for mikysyc2016
11 months ago by
mikysyc201640
mikysyc201640 wrote:

When I use samtools sort bam file, do i need to use samtool index the sorted bam file? what kind situation i need to use samtools index? Thanks!

chip-seq • 3.1k views
ADD COMMENTlink modified 11 months ago by WouterDeCoster39k • written 11 months ago by mikysyc201640
5
gravatar for WouterDeCoster
11 months ago by
Belgium
WouterDeCoster39k wrote:

when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file

Based on your explanation you do, the following, or something similar:

samtools view myfile.sam -o myfile.bam
samtools sort myfile.bam -o myfile_sorted.bam

That's not wrong, but it's also not necessary. Note that you can do the following in one go:

samtools sort myfile.sam -o myfile_sorted.bam

Finally, often you can also have your aligner write directly to samtools sort:

bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam

As such you can avoid quite some wasted time and intermediate files.

Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right?

A bam is a binary (compressed) format. For some applications, e.g. visualization in IGV or variant calling the tool needs 'random access' to the bam file, it needs to be able to find reads without iterating through the entire file. That's what the index is for.

We don't know what your "later analysis" is (please try to be as informative as possible when asking questions), but either it will work (no index required) or it will fail (and tell you it needs an index). Index files are generally small and created quickly, so there is no need to worry about them. Just index your bam.

How to define " coordinate sorted"?

A bam is coordinate sorted if the reads are sorted by coordinates. So reads from the beginning of the first chromosome are first in the file. That's what samtools sort is for. The opposite would be 'sorted by name' in which reads are sorted by their read ID. Aligners also generate unsorted bams.

ADD COMMENTlink written 11 months ago by WouterDeCoster39k

Thank you for your nice reply. If i am right, once i get the "myfile_sorted.bam", I can filter it to get the mapped bam file? My later analysis is using picard to mark pcr duplicates.

ADD REPLYlink written 11 months ago by mikysyc201640

I can filter it to get the mapped bam file?

That would be possible, but that's probably not necessary. I don't know what the aim is of your analysis though.

ADD REPLYlink written 11 months ago by WouterDeCoster39k

I do ChIP-seq analysis.Later i need the bam file to do peak calling.

ADD REPLYlink written 11 months ago by mikysyc201640
1

You don't need to filter out unmapped reads, they'll be ignored by the peak caller (MACS2 will also ignore duplicates for you, so you don't actually need to mark them).

ADD REPLYlink written 11 months ago by Devon Ryan90k

Yes, I prepare to use MACS2 to do peak calling. So your meaning is I just need sorted bam file(as below) to use MACS2?

$ bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam

And do not underdstand why MACS2 can remove unmapped reads and duplicates, if i do not remove umapped reads and not mark the dups^^

I have another question. I see MACS2 process in two webpage. as follow : 1. https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionV/lessons/03_peak_calling_macs.html 2. https://github.com/taoliu/MACS/wiki/Advanced:-Call-peaks-using-MACS2-subcommands. do you think the first one can do all the things as the second one(the second has several subcommands and need to change certain parameters)?

ADD REPLYlink modified 11 months ago • written 11 months ago by mikysyc201640

Unmapped reads are marked in the file already, almost every program knows to skip them when appropriate. MACS2 keeps track of where reads start and internally tracks duplicates accordingly.

The link (callpeaks) is the only one you should typically use. It is itself going through the process described in the second link, which is something you should only ever do manually if absolutely needed.

Whether you filter blacklisted regions beforehand or not is up to you. You can also filter peaks overlapping blacklisted regions, which tends to be quicker.

ADD REPLYlink written 11 months ago by Devon Ryan90k

Do you remove blacklist before you use MACS2?

ADD REPLYlink written 11 months ago by mikysyc201640

I don't expect unmapped reads would interfere with that but I don't know the usual procedure of Chip-seq.

ADD REPLYlink written 11 months ago by WouterDeCoster39k
0
gravatar for h.mon
11 months ago by
h.mon25k
Brazil
h.mon25k wrote:

Please read How To Ask Good Questions On Technical And Scientific Forums. It will help you formulate your questions better, and maybe even it will help you to solve your problems during a more complete elaboration of your questions.

From the samtools man page:

 index     samtools index [-bc] [-m INT] aln.bam|aln.cram [out.index]

     Index  a  coordinate-sorted  BAM or CRAM file for fast random
     access.  (Note that this does not work with SAM files even if
     they are bgzip compressed — to index such files, use tabix(1)
     instead.)

     This index is needed when region arguments are used to  limit
     samtools  view  and similar commands to particular regions of
     interest.
ADD COMMENTlink written 11 months ago by h.mon25k

when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file. Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right? How to define " coordinate sorted"?

ADD REPLYlink modified 11 months ago • written 11 months ago by mikysyc201640
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1104 users visited in the last hour