Question: Calculating Coverage From Pileup File To Find Gene Duplication Events
7.4 years ago
United States
thecuriousbiologist480 wrote:

Hello,

I have a pileup file like below :

``````seq1 272 T 24  ,.\$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23  ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23  ,.\$....,,.,.,...,,,.,...    7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23  ,\$....,,.,.,...,,,.,...^l.  <+;9*<<<<<<<<<=<<:;<<<<
``````

I have to find the gene coverage from this pileup file and if the gene coverage is above a certain "threshhold" coverage, I want to consider that as a gene duplication event.

How can I go about solving this problem ?

The only file that I have is the pileup file. I don't have a BAM file for this.

written 7.4 years ago by thecuriousbiologist480
7.4 years ago
Joseph Hughes2.8k
Scotland, UK
Joseph Hughes2.8k wrote:

The 5th column provides the list of bases at that position. A,T,C,G correspond to alternate alleles and . and , correspond to the reference allele depending on strand. A deleted base is represented by *, \$ is for the end of a read, a symbol ‘^’ marks the start of a read and any other character after ^ correspond to the quality of that base. So all you need to do in your favourite scripting language is to sum the number of ,.ACTG in column 5 and that will give you the coverage at that particular position.

Hope that helps, Joseph

Thanks. Can I just directly use the 4th column to find the mean for specific regions, rather than looking at the 5th column ?

Let's say I have a gene which covers positions 2,3,4 in the above example. Can I not just add 23+23+23 and divide by 3 ? This will mean I have 23X coverage for this gene, is that correct ?

yes, you can simply use the 5th column and the average coverage is correct.