Question: SNP Count from telomeres in sliding window
0
gravatar for hellbio
22 months ago by
hellbio420
hellbio420 wrote:

Hi,

I have a list of SNP genomic coordinates:

a.txt
chr1    12345    12345
chr1    32424    32424    
chr4    23234    23234
....
....
chr22    332324    332324

I would like to find the count of SNPs i.e. from the start of the chromosomes in the genome with increasing length of the chromosomes to plot the proportion of SNPs Y-axis and Distance from the telomeres on X-axis. Could someone hint if it is possible by existing tools for eg: bedtools etc..

snp genome • 660 views
ADD COMMENTlink modified 22 months ago by Alex Reynolds31k • written 22 months ago by hellbio420

Be aware that the edges of chromosomes are often low-complexity regions known for accumulating all kinds of false-positives alignments and false variant calls. Read this paper.

ADD REPLYlink written 22 months ago by ATpoint42k
0
gravatar for Alex Reynolds
22 months ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

One way is via the BEDOPS and UCSC Kent utilities toolkits.

First, generate chromosome bounds:

$ fetchChromSizes hg38 \
    | awk -vOFS="\t" '{ print $1, "0", $2; }' \
    | grep -vE '_' \
    | sort-bed - \
    > hg38.nuc.bed

Second, get SNPs (or use your own — replace common_20180418.starch with a sort-bed-sorted copy of a.txt, if you wanted, for instance):

$ VCF=ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/common_all_20180418.vcf.gz
$ wget -qO- ${VCF}
    | gunzip -c -
    | convert2bed --input=vcf --sort-tmpdir=${PWD} -
    | awk '{ print "chr"$0; }' -
    | starch --omit-signature -
    > common_20180418.starch

Finally, map SNPs to windows over the bounds. In the following example, we chop up the chromosome into sliding windows that are 100kb wide and staggered every 20kb. These windows are piped to bedmap, which counts the number of SNPs that overlap each window:

$ bedops --chop 100000 --stagger 20000 hg38.nuc.bed \
    | bedmap --echo --count --delim '\t' - common_20180418.starch \
    > windows_with_counts.bed

The count of SNPs over the window is in the last column of windows_with_counts.bed.

ADD COMMENTlink written 22 months ago by Alex Reynolds31k

Thank you. Does the count table reports the count of SNPs from both ends of the telomeres in each chromosome? Ideally, it is only required to get the count of SNPs for example upto 20Mb from either ends of the chromosomes. The middle regions i.e. SNPs in the centromeres of the chromosomes are not required to be reported.

ADD REPLYlink modified 22 months ago • written 22 months ago by hellbio420

I admit I don't know much about telomeres. Is 20Mb correct? The UCSC Genome Browser gap positions table labels a 10kb region from the tips as telomeres:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/gap.txt.gz | gunzip -c | awk -vFS="\t" -vOFS="\t" '($8 == "telomere"){ print $2, $3, $4 }' | sort-bed - > hg38.telomeres.bed

If you believe this data, then you could window this file, albeit with smaller parameters for the sliding window:

$ bedops --chop 1000 --stagger 500 hg38.telomeres.bed | bedmap ... > windows_with_counts.bed
ADD REPLYlink written 22 months ago by Alex Reynolds31k

Here the issue is that we are working on model organism for which the telomeres are not defined and available in UCSC. Therefore, we work with an assumption of 20Mb or try 10Mb later to define the telomeres. I wonder how to do this with this assumption.

ADD REPLYlink written 22 months ago by hellbio420
1

You could modify the way in which you manipulate bounds:

$ fetchChromSizes hg38 \
    | awk -vOFS="\t" -vbounds=20000000 '{ printf("%s\t0\t%d\n%s\t%d\t%d\n", $1, bounds, $1, ($2 - bounds - 1), $2; }' \
    | grep -vE '_' \
    | sort-bed - \
    > hg38.custom_telomeres.nuc.bed

Replace hg38 with your build name, for the model organism you are working with.

Then use hg38.custom_telomeres.nuc.bed (or whatever it is called) with bedops --chop --stagger to make sliding windows.

ADD REPLYlink written 22 months ago by Alex Reynolds31k

It gives a syntax error at the semicolon after $2 :

awk: { printf("%s\t0\t%d\n%s\t%d\t%d\n", $1, bounds, $1, ($2 - bounds - 1), $2; } ^ syntax error

ADD REPLYlink modified 22 months ago • written 22 months ago by hellbio420

I maybe forgot a trailing parenthesis:

$ fetchChromSizes hg38 \
    | awk -vFS="\t" -vOFS="\t" -vbounds=20000000 '{ printf("%s\t0\t%d\n%s\t%d\t%d\n", $1, bounds, $1, ($2 - bounds - 1), $2); }' \
    | grep -vE '_' \
    | sort-bed - \
    > hg38.custom_telomeres.nuc.bed
ADD REPLYlink modified 22 months ago • written 22 months ago by Alex Reynolds31k

It still fails with the below error:

INFO: trying MySQL /usr/bin/mysql for database hg38 Non-numeric start coordinate. See line 50 in -. (remember that chromosome names should not contain spaces.)

However, if i have the chromosomes sizes in the below format is there a way without using fetchChromSizes, as my genome build is not available in mysql.

chr1    122678785
chr2    85426708
chr3    91889043
chr4    88276631
ADD REPLYlink written 22 months ago by hellbio420

Just run awk on that file directly, instead of piping in coordinates from fetchChromSizes.

ADD REPLYlink written 22 months ago by Alex Reynolds31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1117 users visited in the last hour