Question: Getting the number of SNPs in some ranges
0
gravatar for A
10 months ago by
A3.7k
A3.7k wrote:

Hi,

I have called copy number and I have 2 files (I have shared link of my files ); One contains some ranges

> head(cndata[,c(2,3,4)])
     Chromosome   Start      End
4470          1   51479   817980
4471          1  818499  1136753
4472          1 1138735  2558308
4473          1 2558740  5724264
4474          1 5724940  5749083
4475          1 5749226 12529544
>

[REDACTED DROPBOX LINK]

I have another file like below

              Chromosome         Position
rs62635286       1                 51479   
rs75454623       1                 114930
rs806731         1                 30923

[REDACTED DROPBOX LINK]

How I can count the number of SNP in each range? For example based on the second table how many SNP are in range of 51479 to 817980?

linux cnv R wgs • 641 views
ADD COMMENTlink modified 9 months ago by RamRS25k • written 10 months ago by A3.7k

I'm not so comfortable with R, but I would probably do this with either bedtools (if you have bed files of ranges and positions) or with bcftools (if you have a vcf file for your SNPs and a bed file or ranges).

ADD REPLYlink written 10 months ago by WouterDeCoster43k

I don't have bed file :(

ADD REPLYlink written 10 months ago by A3.7k

There are a few filetypes in bioinformatics which are used A LOT, such as bed, vcf, sam/bam, gff, fasta/fastq. Always try to work with these formats, since there are so many options and tools which you can use to get your stuff done. Rolling your own solution in a scripting language like R/Python is definitely possible and you'll learn some coding while you're doing that, but is unlikely the fastest thing you can do to get your problem solved.

ADD REPLYlink written 10 months ago by WouterDeCoster43k

We need to "merge with overlap/range" then "group by count", searching these terms should help you solve the problem.

ADD REPLYlink written 10 months ago by zx87549.0k

Hello F!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/8531/extracting-the-number-snp-in-each-range

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 10 months ago by RamRS25k

Instead of tracing my activities in different forums, you guys please help me :(

ADD REPLYlink written 10 months ago by A3.7k
3

You do not help your self. I have lost count of the number of questions you have asked in recent days/weeks which show absolutely no attempt at the task, nor willingess to learn.

We are not here to write code for you.

ADD REPLYlink written 10 months ago by Joe16k
1

We did help you. There are solutions in this thread. What's wrong with those?

ADD REPLYlink written 10 months ago by WouterDeCoster43k

My brain not working nowadays :( too complicated solutions

ADD REPLYlink written 10 months ago by A3.7k
3

We're glad to put you on the track to a solution, but we're not going to do your work. I estimate that the solution described here would take you 15 minutes of concentration, maximally. By the way, come say hi on slack if you have the time.

ADD REPLYlink written 10 months ago by WouterDeCoster43k
1

Soooo...you just expect others to do your work for you?

ADD REPLYlink written 10 months ago by Joe16k
1
gravatar for A
10 months ago by
A3.7k
A3.7k wrote:
snp_table <- read.table('WTSI-OESO_121_1pre.copynumber.txt')
genomic_ranges <- read.table('WTSI-OESO_121_1pre.copynumber.caveman.txt.csv', sep = ',', col.names = c('Chromosome', 'Start', 'End'), stringsAsFactors = F)

how_many_in_range <- function(coords){
    coords = as.vector(coords)
    sum(snp_table$Chromosome == coords[1] & snp_table$Position > as.numeric(coords[2]) & snp_table$Position < as.numeric(coords[3]))
}

genomic_ranges$number_of_snps <- apply(genomic_ranges, 1, how_many_in_range)
ADD COMMENTlink modified 10 months ago by RamRS25k • written 10 months ago by A3.7k
3

Thanks for posting a home-grown solution that worked for you! Do note that the other replies suggest solutions that employ widely accepted and optimized tools for exactly the kind of question you were addressing. It's usually more efficient in the long-run to familiarize oneself with standard tools of the field rather than asking other to re-invent a script, as small as it may be.

ADD REPLYlink modified 10 months ago • written 10 months ago by Friederike5.3k
1

Try to verify the other solutions and accept ones that work. Also, it is good practice to link to the original source when you copy-paste code in open forums.

This solution is from Bioinfo SE: https://bioinformatics.stackexchange.com/a/8532/650

ADD REPLYlink modified 10 months ago • written 10 months ago by RamRS25k
5
gravatar for Damian Kao
10 months ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You have a .csv file with chromosome, start, end coordinates. You can change that into a .bed file pretty easily. Same with your SNP csv file.

Convert those csv files to bed and then use bedtools intersect.

ADD COMMENTlink written 10 months ago by Damian Kao15k

Just in case you're not sure this is what BED looks like. All you need to do is reorder your columns. You can do it in Excel.

ADD REPLYlink written 10 months ago by Emily_Ensembl20k

Just to be complete, the definition of the BED file format simply ask for exactly those three columns, separated by tabs. To turn your second file into a BED file, you would obviously need to add a start column. Note that BED files' coordinates are zero-based, half-open coordinates (explained, for example, here), so it may help precision if you found out which convention was used for the files you have to make the conversion exact.

ADD REPLYlink written 10 months ago by Friederike5.3k
4
gravatar for ATpoint
10 months ago by
ATpoint29k
Germany
ATpoint29k wrote:

Save GRanges as BED ( How To Write Data In A Granges Object To A Bed File. ) and then use BEDtools intersect. Have a look at the -c and -wa -wb options. This should not be a problem.

ADD COMMENTlink modified 10 months ago • written 10 months ago by ATpoint29k
3
gravatar for Alex Reynolds
10 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Convert your SNPs from VCF to BED with vcf2bed, and then map them with bedmap --count to count SNPs that overlap the ad-hoc region:

$ vcf2bed < snps.vcf > snps.bed
$ echo -e '1\t51479\t817980' | bedmap --echo --count --delim '\t' - snps.bed > answer.bed

The count will be in the fourth column of answer.bed.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1637 users visited in the last hour