Question

Increasing widths of the peaks in a bed file from ChIP-seq

0

Entering edit mode

8.4 years ago

morovatunc ▴ 560

Hello Biostars community,

I have generated bed file from chip-seq experiment and would like to filter my mutations that occurred in these regions with bed file input. But I have found less mutations than I expected.

My initial guess is increasing the peak with by 500 bp from both sides but I have a concern before I conduct this step. I dont know if -500 from start and +500 to end will work because what if a peak is at the edge of a chromosome. Therefore, could someone illuminate me about the positions of the chromosomes?

1) Does grch37 assembly start from 0 and continuously go until the end of mitochondrial genome ? or 2) for each chromosome it starts from 0 and end some where and for the next chromosome it starts from the beginning?

I know this is a very fundemental question but sometimes we do complicated work and forget about this basic stuff.

Thank you for the help,

Best,

Tunc.

ChIP-Seq vcftools mutation calling • 1.8k views

ADD COMMENT • link updated 8.4 years ago by Devon Ryan 104k • written 8.4 years ago by morovatunc ▴ 560

1

Entering edit mode

Each chromosome starts at 1 and runs until the end of the chromosome. The next chromosome will begin again at 1.

For example: chr1: 1 - 197,195,432 chr2: 1 - 181,748,087

You can get chromosome sizes from UCSC. The human is here

ADD REPLY • link 8.4 years ago by jotan ★ 1.3k

0

Entering edit mode

Thank you for the answer.

ADD REPLY • link 8.4 years ago by morovatunc ▴ 560

score 2 · Accepted Answer · 2016-05-31

2

Entering edit mode

8.4 years ago

Devon Ryan 104k

bedtools slop can properly handle extending your regions at the end of chromosomes/contigs. All chromosomes and contigs start at 0 (or 1, depending on how you count).

ADD COMMENT • link 8.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you for the answer. I think this counting will differ based on hg19 and grch37 ? right? I mean the starting from 0 or 1.

ADD REPLY • link 8.4 years ago by morovatunc ▴ 560

1

Entering edit mode

It only differs between file types (e.g., BED files start at 0 and GTF files at 1). The base 1 in a fasta file is always the first one, regardless of whether it's hg19 or GRCh37, which are essentially the same thing (there might be difference in the mitochondrial sequence and at least UCSC's version of hg19 merged a bunch of contigs together into *_random supercontigs).

ADD REPLY • link 8.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you for the response.

ADD REPLY • link 8.4 years ago by morovatunc ▴ 560

0

Entering edit mode

Dear Devon,

I am really confused with the annotations variances. Even though I read the following article. Hg19 Versus Grch37

My data is based on;

Genome Build GRCh37 Reference Name hs37d5

I am really confused at this subject. For example, bedtools have mysql connection to hg19 data. But as I checked the difference between my file regions, the only difference seems to be the "chr" part of the chr. and the annotation of unplaced contigs ("GL0190" to 'chr19_gl190' i just made it up). Other than that the start and end are same.

Could you give me some resources to at least read to make me able to see the whole picture.

Best,

Tunc.

ADD REPLY • link 8.4 years ago by morovatunc ▴ 560

0

Entering edit mode

There's nothing really to read on this. Different sources use different chromosome names, as far as the sequence goes that's the only big difference. If you need to know how the names relate, I have the mappings here. hg19 is UCSC, GRCh37 is Ensembl/Gencode.

ADD REPLY • link 8.4 years ago by Devon Ryan 104k

0

Entering edit mode

BTW, you're using a modified version of hg19 from the 1000 genomes project. I think Heng Li has a blog post or two on the decoy sequences in hs37d5.

ADD REPLY • link 8.4 years ago by Devon Ryan 104k