Question: Counting summary info (count of reads and average GC) for segment with specific length
0
gravatar for Korsocius
2.3 years ago by
Korsocius110
Korsocius110 wrote:

Dear all, I would like to create table from bam file. Where will be information about GC and for read_count in lenght range.

INPUT: tsv file

CHROM           Start   stop   length  GC
chr1             56971  57065   94  0.287234
chr1            565460  565601  141 0.411348
chr1            754342  754488  146 0.520548
chr1            754392  754544  152 0.532895
chr1            754392  754544  152 0.532895
chr1            767020  767159  139 0.345324

Output:

chrom       region        0-10 lenght          10-20   .................190-200 
chr1        0-60000      Read count,mean GC

So, output will be some summarization in 60kb region. where will be information about count of reads and average GC for this count from 5th column and it will be for each length.

Now I have for binarization:

  for z in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y


do

export $z

for i in {0..249480000..60000}
    do

u=$i
let "u +=60000"

export $i 
export $u 

samtools view *.bam  chr$z:$i-$u | 

done 
done

Thank you Fedik

next-gen genome • 1.0k views
ADD COMMENTlink modified 2.3 years ago by venu6.1k • written 2.3 years ago by Korsocius110
0
gravatar for venu
2.3 years ago by
venu6.1k
Germany
venu6.1k wrote:

I would do something like following. You can loop this over a set of regions or write a small script

paste <(samtools view file.bam 1:0-60000 | wc -l) <(samtools view file.bam 1:0-60000 | awk -F'\t' '{print $10}' | tr -d '\n' | grep -o "GC" | wc -l) <(samtools view file.bam 1:0-60000 | awk -F'\t' '{print $10}' | tr -d '\n' | awk '{print length($1)}') | awk '{print "1:0-60000\t"$1"\t"$2/$3*100}' | sed '1s/^/region\treadCount\tGCper\n/'

Result

region  readCount       GCper
1:0-60000       18237   5.27966
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by venu6.1k

I ve got this. But I need separate by length in region.... This is biggest problem for me. And for you the easiest way :

for z in *.bam
do

bedtools map -a /home/fil/Desktop/binary_NIFTY.bed -b $z -c 10,10 -o count,concat | awk -v OFS="\t" 'BEGIN {print "CHR\tSTART\tSTOP\tCount_READS\tGCcontent"} {n=length($5); gc=gsub("[gcGC]", "", $5); print $1,$2,$3,$4,gc/n}' > "$z".tsv

done
ADD REPLYlink written 2.3 years ago by Korsocius110

its great if your problem is solved. I'm experimenting a lot with process substitutions. This solution came as a result of one of experiments. Anyways the problem is solved :)

ADD REPLYlink written 2.3 years ago by venu6.1k

No it is not :-( I need summary for length distribution. F.e. : for chr1:0-60000 GC and Read count for length 0-10 and GC and Read count for length 10-20 etc to 190-200bp...and for another region...

ADD REPLYlink written 2.3 years ago by Korsocius110

Change regions and follow above mentioned method. It will work.

ADD REPLYlink written 2.3 years ago by venu6.1k

I am really lost in space with this :-(

ADD REPLYlink written 2.3 years ago by Korsocius110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 681 users visited in the last hour