How to visualize a range of positions vs a k-mer?
2
1
Entering edit mode
8.0 years ago
manetsus ▴ 40

I have some k-mers. For each k-mer I have a range of positions in the genome. I have to visualize it to analyze which range is more messy, how these range are scattered etc.

So, I have to plot k-mer vs range of position.

### My Thinking:

I would convert k-mers to corresponding integer number consuming 2bits for every nt.

Now, I would have data like the following format(csv):

corresponding integer of a k-mer, starting of the range, ending of the range


### What I have tried:

I have tried to plot them using python. But as the range of the positions, the mapped integer all are large numbers, it could not afford to plot even a single point.

### Data Range:

The value of k = 15. So, it takes 30 bits to map in binary.

Range of positions are of the order of 10^6.

I have 392938 data in my file.

Could you please suggest me any tool or code to visualize or to plot this?

### Note that:

More specifically, I want to see which minimizer covers which range. It is possible that a minimizer is covering a small portion. It is also possible that a minimizer is covering a large range.

visualization • 2.3k views
2
Entering edit mode
8.0 years ago
Asaf 10k

Try to give some more motivation - it's not clear what you are looking for. Some thoughts though:

• You don't have to keep or plot end positions
• You won't be able to interpret a lot by plotting so many points
• Try to split the genome to chunks and use statistical methods like chi-square or ANOVA and plot their results
0
Entering edit mode

Splitting is a good idea, but in this case, I have to plot so many figure.

2
Entering edit mode
8.0 years ago
John 13k

If you want to visualise the LOCATION of k-mers, I wouldn't bother with saving anything other than the k-mer start position. If all k-mers are 15bp then that's all you need/want :)

Theres a million ways to display such positional data - I'd probably use existing packages that work for SNPs/indels.

0
Entering edit mode

The k-mer location is not important. a range is important here. More specifically, I want to see which minimizer covers which range. It is possible that a minimizer is covering a small portion. It is also possible that a minimizer is covering a large range.

0
Entering edit mode

EDIT: I think I get it now!

I have some k-mers. For each k-mer I have a range of positions in the genome. I have to visualize it to analyze which range is more messy, how these range are scattered etc.

I would separate each kmer-list into it's own file of start/stop sites (BED format). No need for 2bit encoded DNA since it's all the same.

I would then pile-up the intervals to give a WIG or BigWig format file. So a start/stop/depth.

I would then, finally, look at the signal distribution. The counts of bases cover by 1 interval, 2 intervals, 3 intervals .... etc.

Assuming you have 10 or so kmers, I would plot all of these distributions on the same axis.

Finally, minimization seems kind of stupid to me. Choosing the lowest-ascii-sort kmer from a read is a pretty good way to find the kmer with the lowest complexity (AAAAAAAAAAAAACGT), which i appreciate does have benefits on memory-storage requirements, but it's a perversion of the data. They should use the most-complex kmer from the read, giving the same benefits of adjacent reads having the same minimizer, but with the complexity to actually do something with it - not to mention non-adjacent reads are less likely to have the same minimizer (which is a good thing)