I'm trying to calculate one thing and I'm sure someone else has also done this. So maybe someone have a script ready!
I have a draft de novo genome, and augustus predicted too many genes (54 thousand), so I've aligned my RNA-seq to the exons predicted to discharge the exons with bad (5 or less) bases covered by rna-seq reads.
So, I have submitted my bam.sorted file to genomeCoverageBed script with the flag -bga, so I can have the info of 0 coverage.
My output is like this:
itr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1 0 62 0
itr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1 89 90 0
itr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1 224 225 0
itr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1 252 253 0
Which means, for example for line 1, the exon 'tr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1' has zero coverage in the interval 0 to 62 bases. Next line shows another interval for the same exon...
I also have a file with the exon IDs and their total length. Like this:
So, the info I would like to extract is: to select the exons that have 70% or more of its total sequence covered by 0 bases.. So I can take them out of my genome prediction.
Does anyone has done that and can help me?
Thank you so much!