Question

Counting summary info (count of reads and average GC) for segment with specific length

0

Entering edit mode

8.6 years ago

Korsocius ▴ 270

Dear all, I would like to create table from bam file. Where will be information about GC and for read_count in lenght range.

INPUT: tsv file

CHROM           Start   stop   length  GC
chr1             56971  57065   94  0.287234
chr1            565460  565601  141 0.411348
chr1            754342  754488  146 0.520548
chr1            754392  754544  152 0.532895
chr1            754392  754544  152 0.532895
chr1            767020  767159  139 0.345324

Output:

chrom       region        0-10 lenght          10-20   .................190-200 
chr1        0-60000      Read count,mean GC

So, output will be some summarization in 60kb region. where will be information about count of reads and average GC for this count from 5th column and it will be for each length.

Now I have for binarization:

  for z in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y


do

export $z

for i in {0..249480000..60000}
    do

u=$i
let "u +=60000"

export $i 
export $u 

samtools view *.bam  chr$z:$i-$u | 

done 
done

Thank you Fedik

next-gen genome • 3.0k views

ADD COMMENT • link updated 8.6 years ago by venu 7.1k • written 8.6 years ago by Korsocius ▴ 270

score 0 · Answer 1 · 2016-12-14

0

Entering edit mode

8.6 years ago

venu 7.1k

I would do something like following. You can loop this over a set of regions or write a small script

paste <(samtools view file.bam 1:0-60000 | wc -l) <(samtools view file.bam 1:0-60000 | awk -F'\t' '{print $10}' | tr -d '\n' | grep -o "GC" | wc -l) <(samtools view file.bam 1:0-60000 | awk -F'\t' '{print $10}' | tr -d '\n' | awk '{print length($1)}') | awk '{print "1:0-60000\t"$1"\t"$2/$3*100}' | sed '1s/^/region\treadCount\tGCper\n/'

Result

region  readCount       GCper
1:0-60000       18237   5.27966

ADD COMMENT • link 8.6 years ago by venu 7.1k

0

Entering edit mode

I ve got this. But I need separate by length in region.... This is biggest problem for me. And for you the easiest way :

for z in *.bam
do

bedtools map -a /home/fil/Desktop/binary_NIFTY.bed -b $z -c 10,10 -o count,concat | awk -v OFS="\t" 'BEGIN {print "CHR\tSTART\tSTOP\tCount_READS\tGCcontent"} {n=length($5); gc=gsub("[gcGC]", "", $5); print $1,$2,$3,$4,gc/n}' > "$z".tsv

done

ADD REPLY • link 8.6 years ago by Korsocius ▴ 270

0

Entering edit mode

its great if your problem is solved. I'm experimenting a lot with process substitutions. This solution came as a result of one of experiments. Anyways the problem is solved :)

ADD REPLY • link 8.6 years ago by venu 7.1k

0

Entering edit mode

No it is not :-( I need summary for length distribution. F.e. : for chr1:0-60000 GC and Read count for length 0-10 and GC and Read count for length 10-20 etc to 190-200bp...and for another region...