Question: Samtools And Region List
5
gravatar for win
7.3 years ago by
win810
India
win810 wrote:

Hi all, Hope someone can help.

We can pass a region to samtools view such as 1:1000-1010.

Is it possible to pass a list of regions to samtools view and have the output into a single sam file i.e. (something like)

$ samtools view -h .bam regionlist.txt > mysam.sam

what would the syntax for this be?

also, in what format does the region list file should be in?

thanks, a

list samtools • 20k views
ADD COMMENTlink modified 3.8 years ago by Tom Koch110 • written 7.3 years ago by win810

Hello! Thanks everyone for help with a similar problem! I am now only struggling to keep the name of the region with the fragment (I had a bam file of fragments and chose the ones which overlap with the regions of interest (bed file with positions and names of regions) but I need the final file to remember which fragment belongs to which region name). Is it somehow possible?

Thank you! Lucie

ADD REPLYlink written 2.2 years ago by lucik.s.120
15
gravatar for Pablo Marin-Garcia
5.8 years ago by
Spain
Pablo Marin-Garcia1.8k wrote:

Samtoosl have the -L option

samtools view -b -L ROI.bed  file.bam > ROI_file.bam

BUT -L does not use the samtools index so the search is slooow depending on how large is your bam file. In my benchmarks queriying for 10, 100 or 1000 sequences take exactly the same time so seem that bed size does not matter too much. You need to do a bit of benchmarking if you prefer to do a -L or to do a loop (for a bam with 23 million reads take the same time the -L that writting all the stuff for looping ;-)) YMMV

$ time samtools view -b -L 100_ROI.bed file.bam > ROI.bam
real    0m38.831s
user    0m37.666s
sys    0m0.556s


$ time (samtools view -H file.bam > roi_xargs.sam; \
      cat 100_ROI.bed | perl -lane 'print "$F[0]:$F[1]-$F[2]"' | xargs -n1 -t -I{} samtools view file.bam {} >> roi_xargs.sam; \
      samtools view -bSh roi_xargs.sam > roi_xargs.bam \
  )
real   0m7.188s
user   0m5.080s
sys    0m1.304s

And if you feel adventurous and your disks are using isilon or lustre you can use gnu-parallel and forget about same file concurrency and do it in parallel in a breeze. The only issue is that you would need to do the bam merge afterwards.

ADD COMMENTlink written 5.8 years ago by Pablo Marin-Garcia1.8k

Hi Pablo,

 

I want to extract reads from bam file using Samtools view from multiple regions.The regions I have are present in the VCF file,which I want to use.So there are not a small number of regions.

So,I can input VCF file instead of BED file or I have to do some conversions?

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Ron980
3
gravatar for Malachi Griffith
7.3 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith18k wrote:

From the samtools documentation:

     -L FILE  output alignments overlapping the input BED FILE [null]

samtools view -L regionlist.bed -h in.bam > out.sam

The region list file would be in BED format.

ADD COMMENTlink written 7.3 years ago by Malachi Griffith18k
1

The -L option is much slower than directly specifying the region in the command line. I am wondering if anyone find the way to run -L faster.

ADD REPLYlink written 6.5 years ago by yekaixiong10

I also noticed -L is really slow with samtools 0.1.19-44428cd, considering a bed file with only, say, 100 entries. Any ideas why?

ADD REPLYlink written 5.8 years ago by 141341254653464453.5k
1

That is because the -L does not use the index so scan the full file.

ADD REPLYlink written 5.8 years ago by Pablo Marin-Garcia1.8k

I'm having trouble find that in the documentation:

http://samtools.sourceforge.net/samtools.shtml

it might be renamed as -l (lowercase L)

ADD REPLYlink written 7.3 years ago by cl.parallel140

The documentation I posted was taken from: Version: 0.1.18 (r982:295)

ADD REPLYlink written 7.3 years ago by Malachi Griffith18k

I see it now. I wish I could delete my answer.

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by cl.parallel140

I think you can, but it's still a nice answer.

ADD REPLYlink written 7.3 years ago by Matt Shirley9.2k

The online documentation for samtools does not seem to be completely up to date. But if you have it installed you can run it without options and get an accurate list of current parameters and options

ADD REPLYlink written 7.3 years ago by Malachi Griffith18k
2
gravatar for Matt Shirley
7.3 years ago by
Matt Shirley9.2k
Cambridge, MA
Matt Shirley9.2k wrote:

You can also use tabix, which is an index and search tool developed by the Samtools author. Here is an example to fit your needs:

Your SAM file should be coordinate sorted. If it is not, then use gnu sort.

sort -k 3,3 -k 4,4n mysam.sam > mysam.sorted.sam

Then, tabix needs a bgzip compressed file.

bgzip mysam.sorted.sam

Then you need tabix to index your file. This will create a file ending in .tbi

tabix -p sam mysam.sorted.sam.gz

Lastly, you may retrieve your regions in this format:

tabix mysam.sorted.sam.gz chr1:1000-1010 chr2:2000-2020 ... > mysam.sorted.regions.sam
ADD COMMENTlink written 7.3 years ago by Matt Shirley9.2k
1
gravatar for cl.parallel
7.3 years ago by
cl.parallel140
Seattle, WA
cl.parallel140 wrote:

You could use a for loop in a shell script to do this.

# ouput header only, once
samtools view -H aln.bam > mysam.sam
while read line;do
    samtools view aln.bam $line >> mysam.sam
done < regions.txt

where regions.txt is line-based, with each line being:

ref1:start-end

or xargs:

samtools view -H aln.bam > mysam.sam
xargs -a regions.txt -I {} samtools view aln.bam {} >> mysam.sam
ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by cl.parallel140

thanks, this is great. need one more slight detail.

when you say ref1:start-end, i just want to make sure ref1 refers to chromosome number, right?

ADD REPLYlink written 7.3 years ago by win810

yep, chrName:startIndex-endIndex

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by cl.parallel140

Also, you may want to use the -b flag to keep it in binary format for speed and space:

samtools view -bH aln.bam >> mysam.bam
...
samtools view -b aln.bam $line >> mysam.bam
ADD REPLYlink written 7.3 years ago by cl.parallel140

This is a nice, simple solution, but I would urge you to consider tabix if you plan to grab many regions since it should give you much better performance than a shell loop.

ADD REPLYlink written 7.3 years ago by Matt Shirley9.2k

Yes, and for another similar post see:

http://biostars.org/post/show/48963/use-tabix-with-a-list-of-regions/#49244

ADD REPLYlink written 7.3 years ago by cl.parallel140
1
gravatar for Tom Koch
3.8 years ago by
Tom Koch110
Pine Biotech, New Orleans, LA, USA
Tom Koch110 wrote:

Sorry for the bump (it's for future Google visitors), I used cl.parallel's answer as slight guidance but I wanted to get the same region across a set of reads, so I wrote a script. Uses a BED file as input.

for i in $( ls *.bam ); do
    samtools view $i -H > OUTPUTPREFIX${i%.bam}.sam 
    while read -r -a myArray; do 
        samtools view $i ${myArray[0]}:${myArray[1]}-${myArray[2]} >> OUTPUTPREFIX${i%.bam}.sam
     done < INPUTFILE.bed 
done
ADD COMMENTlink written 3.8 years ago by Tom Koch110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1769 users visited in the last hour