Question: Separate Bam File
1
gravatar for filipzembol
3.1 years ago by
filipzembol90
filipzembol90 wrote:

Dear all, I have one problem I need to count number of rows and GC content in bam file for individual segments, but I do not know, how to separate it. the size of segment is 60kb. Than for reads, which has start coordinates to 60000 in chr 1, I will receive the count of reads (rows) and GC content. Please could you help me. I have the source code for count NR in awk and GC content, but it is only for all reads in bam file. And I have some source code to separate bed file to each 60kb segments if it help you.

Count of GC

awk '{ n=length($10); print $10, gsub(/[GCCgcs]/,"",$10)/n;}'   your.sam

Count of reads (rows)

awk 'END {print NR}'

Separate bed file:

$ awk \
     '{ \
        regionChromosome = $1; \
         regionStart = $2; \
         regionStop = $3; \
         val = 1; \
          for ( baseStart = regionStart; baseStart < regionStop; baseStart += 60000 ) { \
            baseStop = baseStart + 60000; \
            print regionChromosome"\t"baseStart"\t"baseStop"\t"val++; \
          } \
      }' hg19.extents.bed \
    | bedmap --echo --echo-map-id --delim "\t" 1.bed - > answer.1.bed

Thank you so much.

perl bam coordinates awk • 1.3k views
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by filipzembol90
3
gravatar for Sean Davis
3.1 years ago by
Sean Davis23k
National Institutes of Health, Bethesda, MD
Sean Davis23k wrote:

You might try using samtools:

samtools view my.bam chr1:60000-120000 | your_awk_command

This will pipe all reads that overlap with a region on chromosome 1 between 60kb and 120kb.

ADD COMMENTlink written 3.1 years ago by Sean Davis23k

Yes I understand you but iI need to do it for all bam fie. It is mean, I separate bam file to each bam files, separate by chromosome (I receive 25 files) and every file I have to separate to each part. Than from 0-60kb; 60kb-120kb,120kb-180kb to the end and for this partitions count GC and fragments.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by filipzembol90
1

You can specify the chromosome you want:

samtools view my.bam chr1:60000-120000 | your_awk_command
samtools view my.bam chr1:120000-180000 | your_awk_command
samtools view my.bam chr1:180000-240000 | your_awk_command
...
samtools view my.bam chr2:60000-120000 | your_awk_command
samtools view my.bam chr2:120000-180000 | your_awk_command
samtools view my.bam chr2:180000-240000 | your_awk_command

etc.

Of course, I'd recommend creating a small script to do it, but you don't need to pull apart the bam file to operate on different chromosomes.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Fwip470

Ok, i understand of this, but I separate becasue for cycle will be for me easier than if I have all chrom together, i am totally beginner. I do not have any other idea to do this. Because if I will write all segments by hand, it is unpossible.

ADD REPLYlink written 3.1 years ago by filipzembol90
1

This is a great opportunity to learn some programming and scripting. I would suggest identifying a local "mentor" who can work with you to do so; doing so will be much more efficient than using biostars. If this is not something you feel comfortable with, then I think you might consider identifying a collaborator who can help you with your data, as data analysis is absolutely FULL of small tasks like this that require scripting.

ADD REPLYlink written 3.1 years ago by Sean Davis23k
0
gravatar for filipzembol
3.1 years ago by
filipzembol90
filipzembol90 wrote:

Ok I create a code like this, which solve my problem, but still time I have one problem and I cannot finish it. I would like to create many outputs (files), which has name same like variable $h. Than I want one think, if the for cycle for "h" finished, the code create a text fie with same name like $h.txt . I tried this, but it doesnt work:

 #!/bin/bash 
 for h in {1..22..1} 
 do 
 for i in {0..249480000..60000}
 do 
 u=$i let "u +=60000"

 echo $i"-"$u | samtools view /home/filip/Desktop/Posledni\ NIFTY/019/odfiltrovany019.bam  chr$h:$i-$u | awk '{ n=length($10);print gsub(/[GCCgcs]/,"",$10)/n;}'| awk '{s+=$1}END{print NR,s/NR}' > ${h%}.txt

 done  
 done

Thank you for your help

ADD COMMENTlink written 3.1 years ago by filipzembol90

Best to ask a new question. You'll want to include your code as well as desired output and any errors you receive.

ADD REPLYlink written 3.1 years ago by Sean Davis23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 968 users visited in the last hour