Question: HOWTO extract regions from GRanges object
0
gravatar for Assa Yeroslaviz
22 months ago by
Assa Yeroslaviz1.1k
Munich
Assa Yeroslaviz1.1k wrote:

Hi,

I have a small set of bam files from centromere regions of S. cerevisiae. Those I have extracted from the complete bam files using the samtools view option based on a list of centromere positions as shown below. I have used the centromere middle point to go forward and reverse 30kb and extract the regions.

chromosome    CEN-[30kb]    CENmitte    CEN+[30kb]
I    121500    151500    181500
II    208300    238300    268300
III    84500    114500    144500
IV    419700    449700    479700 ...

I would like to see how many reads are mapped to each of the positions in these regions. This i would like to count into 50b bins of genome regions. For that I have used the tileGenome command to create a virtual GRanges object as shown below (this is why it has no actual information yet).

bins1 <- tileGenome(seqinfo(Scerevisiae), tilewidth=500, cut.last.tile.in.chrom=TRUE)
GRanges object with 24322 ranges and 0 metadata columns:
          seqnames         ranges strand
             <Rle>      <IRanges>  <Rle>
      [1]     chrI   [   1,  500]      *
      [2]     chrI   [ 501, 1000]      *
      [3]     chrI   [1001, 1500]      *
      [4]     chrI   [1501, 2000]      *
      [5]     chrI   [2001, 2500]      *
      ...      ...            ...    ...
  [24318]     chrM [83501, 84000]      *
  [24319]     chrM [84001, 84500]      *
  [24320]     chrM [84501, 85000]      *
  [24321]     chrM [85001, 85500]      *
  [24322]     chrM [85501, 85779]      *
  -------
  seqinfo: 17 sequences (1 circular) from sacCer3 genome

using the summarizeOverlaps command I can now summarize exactly how many reads are mapped into each of my bins in the genome

bamFiles <- list.files(path = ~/centromereBamFIles.withchrIII/", pattern = ".bam$", full.names = TRUE)
olapTable <- summarizeOverlaps(bins1, bamFiles, inter.feature=FALSE)

But the problem is that I still have the complete genome in the bins1 GRanges object. But I would like it to contain only the regions I am using for the centromeres.

Hence my question, is there a way to extract multiple regions from the GRanges object? I would like to extract the centromere regions (± a few 20kb) of each chromosome. e.g. these regions:

chrI:101500-201500
chrII:188300-288300
chrIII:64500-164500

But I can't find a way to do it?

I know I can remove all the rows with a sum of 0, but with that I will also remove the rows from the centromere regions where no reads were mapped. These I need to keep in my GRanges object.. So I will try to work it out with restrict or findOverlaps before I run the summarizeOverlaps with the bam files

Is there a more direct way to do it?

thanks

Assa

ADD COMMENTlink modified 22 months ago by geek_y8.1k • written 22 months ago by Assa Yeroslaviz1.1k

Do you have centromere coordinates ? What do you mean by "I would like to extract the centromere regions (± a few 20kb) of each chromosome"

ADD REPLYlink written 22 months ago by geek_y8.1k
2
gravatar for geek_y
22 months ago by
geek_y8.1k
Barcelona/London
geek_y8.1k wrote:

It should be simple with bedtools.

regions.bed

chrI    101500    201500
chrII    188300    288300
chrIII    64500    164500

Try:

bedtools makewindows -b regions.bed  -w 500 | less

To get the SAF format for featureCounts:

bedtools makewindows -b regions.bed  -w 500 | awk -v OFS="\t" '{print $1":"$2"-"$3,$1,$2,$3,"+"}' > regions.saf

You don't need to first generate all the bins and then subset regions. You can use the regions.saf file with featureCounts to get the count matrix.

ADD COMMENTlink modified 22 months ago • written 22 months ago by geek_y8.1k

thanks. I haven't thought of doing it like that.

 

ADD REPLYlink written 22 months ago by Assa Yeroslaviz1.1k
1
gravatar for Michael Dondrup
22 months ago by
Bergen, Norway
Michael Dondrup43k wrote:

It is not clear why you would like to do this. Your GRanges do not contain any additional information, so there won't be anything to extract except the bins themselves. However there is no sequence information for the centromere itself I guess. So let's assume it's a toy example. There are two ways to approach this:

  • findOverlaps: find overlaps between the GRanges and the centromeric regions, also converted to GRanges
  • restrict: restrict the GRanges  to the centromeric regions, will also truncate some ranges potentially

See the documentation of those functions. 

ADD COMMENTlink written 22 months ago by Michael Dondrup43k

sorry yes I can see that the question might be a bit unclear. i have worked it out and made it more detailed, so that it will be easier to understand.

sorry about that.

ADD REPLYlink written 22 months ago by Assa Yeroslaviz1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1448 users visited in the last hour