How to visualize a range of positions vs a k-mer?
2
1
Entering edit mode
7.7 years ago
manetsus ▴ 40

I have some k-mers. For each k-mer I have a range of positions in the genome. I have to visualize it to analyze which range is more messy, how these range are scattered etc.

So, I have to plot k-mer vs range of position.

My Thinking:

I would convert k-mers to corresponding integer number consuming 2bits for every nt.

Now, I would have data like the following format(csv):

corresponding integer of a k-mer, starting of the range, ending of the range

What I have tried:

I have tried to plot them using python. But as the range of the positions, the mapped integer all are large numbers, it could not afford to plot even a single point.

Data Range:

The value of k = 15. So, it takes 30 bits to map in binary.

Range of positions are of the order of 10^6.

I have 392938 data in my file.

Could you please suggest me any tool or code to visualize or to plot this?

Note that:

More specifically, I want to see which minimizer covers which range. It is possible that a minimizer is covering a small portion. It is also possible that a minimizer is covering a large range.

visualization • 2.2k views
ADD COMMENT
2
Entering edit mode
7.7 years ago
Asaf 10k

Try to give some more motivation - it's not clear what you are looking for. Some thoughts though:

  • You don't have to keep or plot end positions
  • You won't be able to interpret a lot by plotting so many points
  • Try to split the genome to chunks and use statistical methods like chi-square or ANOVA and plot their results
ADD COMMENT
0
Entering edit mode

Splitting is a good idea, but in this case, I have to plot so many figure.

ADD REPLY
2
Entering edit mode
7.7 years ago
John 13k

If you want to visualise the LOCATION of k-mers, I wouldn't bother with saving anything other than the k-mer start position. If all k-mers are 15bp then that's all you need/want :)

Theres a million ways to display such positional data - I'd probably use existing packages that work for SNPs/indels.

ADD COMMENT
0
Entering edit mode

The k-mer location is not important. a range is important here. More specifically, I want to see which minimizer covers which range. It is possible that a minimizer is covering a small portion. It is also possible that a minimizer is covering a large range.

ADD REPLY
0
Entering edit mode

EDIT: I think I get it now!

I have some k-mers. For each k-mer I have a range of positions in the genome. I have to visualize it to analyze which range is more messy, how these range are scattered etc.

I would separate each kmer-list into it's own file of start/stop sites (BED format). No need for 2bit encoded DNA since it's all the same.

I would then pile-up the intervals to give a WIG or BigWig format file. So a start/stop/depth.

I would then, finally, look at the signal distribution. The counts of bases cover by 1 interval, 2 intervals, 3 intervals .... etc.

Assuming you have 10 or so kmers, I would plot all of these distributions on the same axis.

Finally, minimization seems kind of stupid to me. Choosing the lowest-ascii-sort kmer from a read is a pretty good way to find the kmer with the lowest complexity (AAAAAAAAAAAAACGT), which i appreciate does have benefits on memory-storage requirements, but it's a perversion of the data. They should use the most-complex kmer from the read, giving the same benefits of adjacent reads having the same minimizer, but with the complexity to actually do something with it - not to mention non-adjacent reads are less likely to have the same minimizer (which is a good thing)

ADD REPLY

Login before adding your answer.

Traffic: 2550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6