Question: Getting the number of SNPs in some ranges
0
gravatar for F
4 weeks ago by
F3.4k
Iran
F3.4k wrote:

Hi,

I have called copy number and I have 2 files (I have shared link of my files ); One contains some ranges

> head(cndata[,c(2,3,4)])
     Chromosome   Start      End
4470          1   51479   817980
4471          1  818499  1136753
4472          1 1138735  2558308
4473          1 2558740  5724264
4474          1 5724940  5749083
4475          1 5749226 12529544
>

[REDACTED DROPBOX LINK]

I have another file like below

              Chromosome         Position
rs62635286       1                 51479   
rs75454623       1                 114930
rs806731         1                 30923

[REDACTED DROPBOX LINK]

How I can count the number of SNP in each range? For example based on the second table how many SNP are in range of 51479 to 817980?

linux cnv R wgs • 387 views
ADD COMMENTlink modified 21 days ago by RamRS21k • written 4 weeks ago by F3.4k

I'm not so comfortable with R, but I would probably do this with either bedtools (if you have bed files of ranges and positions) or with bcftools (if you have a vcf file for your SNPs and a bed file or ranges).

ADD REPLYlink written 4 weeks ago by WouterDeCoster39k

I don't have bed file :(

ADD REPLYlink written 4 weeks ago by F3.4k

There are a few filetypes in bioinformatics which are used A LOT, such as bed, vcf, sam/bam, gff, fasta/fastq. Always try to work with these formats, since there are so many options and tools which you can use to get your stuff done. Rolling your own solution in a scripting language like R/Python is definitely possible and you'll learn some coding while you're doing that, but is unlikely the fastest thing you can do to get your problem solved.

ADD REPLYlink written 4 weeks ago by WouterDeCoster39k

We need to "merge with overlap/range" then "group by count", searching these terms should help you solve the problem.

ADD REPLYlink written 28 days ago by zx87547.3k

Hello F!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/8531/extracting-the-number-snp-in-each-range

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 28 days ago by RamRS21k

Instead of tracing my activities in different forums, you guys please help me :(

ADD REPLYlink written 28 days ago by F3.4k
3

You do not help your self. I have lost count of the number of questions you have asked in recent days/weeks which show absolutely no attempt at the task, nor willingess to learn.

We are not here to write code for you.

ADD REPLYlink written 28 days ago by jrj.healey12k
1

We did help you. There are solutions in this thread. What's wrong with those?

ADD REPLYlink written 28 days ago by WouterDeCoster39k

My brain not working nowadays :( too complicated solutions

ADD REPLYlink written 28 days ago by F3.4k
3

We're glad to put you on the track to a solution, but we're not going to do your work. I estimate that the solution described here would take you 15 minutes of concentration, maximally. By the way, come say hi on slack if you have the time.

ADD REPLYlink written 28 days ago by WouterDeCoster39k
1

Soooo...you just expect others to do your work for you?

ADD REPLYlink written 28 days ago by jrj.healey12k
1
gravatar for F
28 days ago by
F3.4k
Iran
F3.4k wrote:
snp_table <- read.table('WTSI-OESO_121_1pre.copynumber.txt')
genomic_ranges <- read.table('WTSI-OESO_121_1pre.copynumber.caveman.txt.csv', sep = ',', col.names = c('Chromosome', 'Start', 'End'), stringsAsFactors = F)

how_many_in_range <- function(coords){
    coords = as.vector(coords)
    sum(snp_table$Chromosome == coords[1] & snp_table$Position > as.numeric(coords[2]) & snp_table$Position < as.numeric(coords[3]))
}

genomic_ranges$number_of_snps <- apply(genomic_ranges, 1, how_many_in_range)
ADD COMMENTlink modified 28 days ago by RamRS21k • written 28 days ago by F3.4k
3

Thanks for posting a home-grown solution that worked for you! Do note that the other replies suggest solutions that employ widely accepted and optimized tools for exactly the kind of question you were addressing. It's usually more efficient in the long-run to familiarize oneself with standard tools of the field rather than asking other to re-invent a script, as small as it may be.

ADD REPLYlink modified 27 days ago • written 27 days ago by Friederike4.2k
1

Try to verify the other solutions and accept ones that work. Also, it is good practice to link to the original source when you copy-paste code in open forums.

This solution is from Bioinfo SE: https://bioinformatics.stackexchange.com/a/8532/650

ADD REPLYlink modified 28 days ago • written 28 days ago by RamRS21k
5
gravatar for Damian Kao
4 weeks ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You have a .csv file with chromosome, start, end coordinates. You can change that into a .bed file pretty easily. Same with your SNP csv file.

Convert those csv files to bed and then use bedtools intersect.

ADD COMMENTlink written 4 weeks ago by Damian Kao15k

Just in case you're not sure this is what BED looks like. All you need to do is reorder your columns. You can do it in Excel.

ADD REPLYlink written 28 days ago by Emily_Ensembl18k

Just to be complete, the definition of the BED file format simply ask for exactly those three columns, separated by tabs. To turn your second file into a BED file, you would obviously need to add a start column. Note that BED files' coordinates are zero-based, half-open coordinates (explained, for example, here), so it may help precision if you found out which convention was used for the files you have to make the conversion exact.

ADD REPLYlink written 28 days ago by Friederike4.2k
4
gravatar for ATpoint
4 weeks ago by
ATpoint16k
Germany
ATpoint16k wrote:

Save GRanges as BED ( How To Write Data In A Granges Object To A Bed File. ) and then use BEDtools intersect. Have a look at the -c and -wa -wb options. This should not be a problem.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by ATpoint16k
3
gravatar for Alex Reynolds
27 days ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Convert your SNPs from VCF to BED with vcf2bed, and then map them with bedmap --count to count SNPs that overlap the ad-hoc region:

$ vcf2bed < snps.vcf > snps.bed
$ echo -e '1\t51479\t817980' | bedmap --echo --count --delim '\t' - snps.bed > answer.bed

The count will be in the fourth column of answer.bed.

ADD COMMENTlink modified 27 days ago • written 27 days ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1141 users visited in the last hour