How to obtain the non-redundant total size of the GRCh38 genome?
3
0
Entering edit mode
7.2 years ago

I am trying to obtain the overall size of the exome, i.e. every single position that has coverage. As a control I need to find the theoretical size from the reference genome. I have a few questions:

  • Is there an official size for the exome from the GRCh38 genome?
  • if not, which is there efficient way to calculate it? A script to merge/coalesce all the exon from the reference genome?

Thanks,
Salvo

exome • 2.2k views
ADD COMMENT
1
Entering edit mode

It should be possible to convert the gene annotation into a GRange object (like here). Once this is done, computing exome size is easy if you know R (see here for instance).

ADD REPLY
0
Entering edit mode

Hi Salvo,

You'll get all covered bases of your alignment, whether these fall into exons or not. I think the best way to get your statistics is to generate/download a exon bed file (Biomart would be the easiest way). With this you can compute directly the sum of all features, and intersect it with your coverage file.

You have to take care of alternative splicing within an exon, alternative polyadenylation site usage, etc.. Also you should think of including/excluding the alternative scaffold/patches (e.g. GL000194.1 , KI270726.1 ,...).

Cheers, Michael

[Update] This comment was written as a response to a given awk-script. Which is removed from the original post.

ADD REPLY
1
Entering edit mode
7.2 years ago

The easiest way is probably to use BEDtools 'genomecov' with the genome and exonic GFFs.

ADD COMMENT
1
Entering edit mode
7.2 years ago
venu 7.1k

This has been answered a while ago under - How to obtain the length of coding regions for the list of genes?

All you need is

  • GTF file
  • BedTools
ADD COMMENT
0
Entering edit mode

Thanks Venu for your helps, I used your command line:

cat gencode.v22.annotation.gtf | awk '{if($3=="exon") print $10"\t"$5-$4}' | sed -e 's/"//g' -e 's/;//' | bedtools groupby -i - -g 1 -c 2 -o sum > Exon_lengths.txt

and I obtained 290,590,453 (GRCh38 used in GDC data common for WXS).

ADD REPLY
2
Entering edit mode

Just a couple of comments on this script...

  • Shouldn't feature length be $5-$4+1? Plus one because GTF starts from 1 not 0.

  • print $10 prints the first attribute of the of the attribute field (actually the string between the 9th and 10th blank character). This may be ok but it may be a bit brittle.

  • To compute the total length you cannot just sum Exon_lengths.txt. I think you should merge the coordinates of the exons after you extract them from the GTF. If exons overlap or are shared among transcripts then you double count their length if you don't merge.

ADD REPLY
0
Entering edit mode
7.2 years ago

To obtain the same thing from Bam coming from WXS, I used this strategy:

I made the pileup with using the following command line:

samtools mpileup 0017ba4c33a07ba807b29140b0662cb1_gdc_realn.bam > 0017ba4c33a07ba807b29140b0662cb1_gdc_realn.pileup

later I used this command line:

awk '{if($4 > 10)SUM+=1} END{print SUM}' 0017ba4c33a07ba807b29140b0662cb1_gdc_realn.pileup

the coverage is $4, and I obtained 95,972,925.

ADD COMMENT

Login before adding your answer.

Traffic: 2506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6