Question: Simple clustering of start/stop nt positions in a text file column?
1
gravatar for oliver.bayfield
3.9 years ago by
United States
oliver.bayfield110 wrote:

I have a text file with 4 columns: gi, evalue, start, stop.

There's several duplicates of each gi but they're not identical in coverage, so I want to extract from the text file only certain rows: those HSPs or hits (this a tblastn output) which have the greatest coverage for that gi.
HOWEVER, gi isn't enough of a determinant, because there is often more than one 'top' hsp per genome/gi. That is, there is more than one copy of the gene in a genome. Evalue doesn't seem to be enough a decider here in any scenario so column 2 can be ignored. Once the hits for a particular gi are distinguished based on their genome positions, I will simply take the hit which has the longest coverage (stop-start).

What I would like to do is in this intermediate step is, per gi entry (e.g. below), identify separate groups of hits/hsps based on their start/stop positions, or rather employ some sort of hierarchal clustering to set the rows in groups which could for isntance be denoted in a 5th "ID" column, values 1,2,3 etc.

I don't yet know how to go about it (most of the clustering literature out there seems to go beyond what I need) but for instance, if the first row in a particular gi group was assigned ID 1, if the other rows in that group had start and stop values that differed by more than 600 nt say to those of this one, they would belong to a separate group (i.e. impose a cut-off 600). And so on, with hits not belonging to one used to nucleate other subgrooup (2,3 etc.).

I started to employ numpy and generated a difference array for all start values with all other start values, and the same for stop, but the scripting to go through each row and call on this array was going to get very concoluted.
I know scipy has some clutering algorithms but I can't tell if that's applicable to my problem.

If there is a simpler route you can think of please let me know.

Below is an extract of the data, where column gi is an extract for all values for that particular gi entry. In this entry there is an additional column showing sequence lengths.

 

99036121    0.0    1392057    1390123 1934
99036121    0.0    1392099    1390123 1976
99036121    0.0    1392111    1390123 1988
99036121    0.0    1392123    1390123 2000
99036121    0.0    1392123    1390543 1580
99036121    0.0    1730139    1728823 1316
99036121    0.0    1730139    1728829 1310
99036121    0.0    1768983    1767775 1208
99036121    1e-133    1768950    1767778 1172
99036121    1e-69    1768983    1768216 767
99036121    1e-77    1390509    1390123 386
99036121    1e-83    1768983    1767787 1196
99036121    2e-117    1768563    1767775 788
99036121    2e-58    1768983    1768279 704
99036121    3e-121    1768950    1767775 1175
99036121    3e-123    1768950    1767775 1175
99036121    3e-133    1768983    1767775 1208
99036121    3e-93    1768428    1767775 653
99036121    4e-101    1768950    1767775 1175
99036121    4e-135    1768983    1767775 1208
99036121    4e-136    1768983    1767775 1208
99036121    5e-133    1768983    1767787 1196
99036121    5e-136    1768983    1767775 1208
99036121    6e-112    1768542    1767775 767
99036121    6e-138    1768983    1767775 1208
99036121    7e-96    1768533    1767775 758
99036121    8e-136    1768983    1767775 1208

python clustering blast assembly • 1.2k views
ADD COMMENTlink modified 3.9 years ago by Sean Davis25k • written 3.9 years ago by oliver.bayfield110
2
gravatar for Sean Davis
3.9 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

This is straightforward with Bioconductor GRanges.  The process is roughly:

  • Load data for each GI
  • Create GRanges for each HSP
  • Use reduce() to collapse all overlapping hits into contiguous blocks

The return value, gr1, contains the three contiguous blocks of hits.

> z = read.table(textConnection("99036121    0.0    1392057    1390123 1934
99036121    0.0    1392099    1390123 1976
                              99036121    0.0    1392111    1390123 1988
                              99036121    0.0    1392123    1390123 2000
                              99036121    0.0    1392123    1390543 1580
                              99036121    0.0    1730139    1728823 1316
                              99036121    0.0    1730139    1728829 1310
                              99036121    0.0    1768983    1767775 1208
                              99036121    1e-133    1768950    1767778 1172
                              99036121    1e-69    1768983    1768216 767
                              99036121    1e-77    1390509    1390123 386
                              99036121    1e-83    1768983    1767787 1196
                              99036121    2e-117    1768563    1767775 788
                              99036121    2e-58    1768983    1768279 704
                              99036121    3e-121    1768950    1767775 1175
                              99036121    3e-123    1768950    1767775 1175
                              99036121    3e-133    1768983    1767775 1208
                              99036121    3e-93    1768428    1767775 653
                              99036121    4e-101    1768950    1767775 1175
                              99036121    4e-135    1768983    1767775 1208
                              99036121    4e-136    1768983    1767775 1208
                              99036121    5e-133    1768983    1767787 1196
                              99036121    5e-136    1768983    1767775 1208
                              99036121    6e-112    1768542    1767775 767
                              99036121    6e-138    1768983    1767775 1208
                              99036121    7e-96    1768533    1767775 758
                              99036121    8e-136    1768983    1767775 1208
                              "),header=FALSE)
> library(GenomicRanges)
> # Note that start and end are switched in these example data
> # so you may need to be careful in the next line if this is 
> # not generally true.
> gr = GRanges(seqnames=z[,1],ranges=IRanges(start=z[,4],end=z[,3]))
> gr1 = reduce(gr)
> gr1
GRanges object with 3 ranges and 0 metadata columns:
      seqnames             ranges strand
         <Rle>          <IRanges>  <Rle>
  [1] 99036121 [1390123, 1392123]      *
  [2] 99036121 [1728823, 1730139]      *
  [3] 99036121 [1767775, 1768983]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Sean Davis25k

That's amazing, thank you. Your comment about start and end, its because this example is on the negative strand, will it be able to account for HSPs on the either strand? Is this is the install?
http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html

ADD REPLYlink written 3.9 years ago by oliver.bayfield110
1

I ignored strand, but you can use strand in the constructor.  You'll need to make sure that start is <= end, though, even in the case of reverse strand.

ADD REPLYlink written 3.9 years ago by Sean Davis25k

I see examples where you can manually define strand=() but how do you get IRanges to look in z[,3] or z[,4] for start and end depending on which strand each row represents? I can introduce an extra field [,6] of +/- based on the difference between [,3] and [,4]. Is it possible to put some 'if' statements in?
Alternatively I can readjust my input to make sure [,3] is always less than [,4], put a another column in with +/- to preserve the strand data, and then swaps things back once I've got the output.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by oliver.bayfield110

Your latter approach is the way to go.  I'd suggest a small function that takes your data.frame as read from disk and returns a GRanges object and a second function that takes a GRanges object and returns a data.frame that conforms to whatever you need for output back to disk.  For a technical detail, take a look at the ifelse() function.

ADD REPLYlink written 3.9 years ago by Sean Davis25k

Thanks, I went with the latter approach and it did the job.

ADD REPLYlink written 3.9 years ago by oliver.bayfield110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1408 users visited in the last hour