Question

What Are Some Less Known Yet Simple And Powerful Bioinformatics Data Analysis Steps That You Commonly Use?

13

Entering edit mode

12.5 years ago

Istvan Albert 100k

Here is something I do but I don't think it is well known.

To find out what the actual DNA fragment sizes for a single end Chip-Seq sequencing experiments were you can successively shift the positions of the mapped reads on one strand and count the number of times you have an exact match for the other strand. At the actual fragment size you'll get a maximum.

Below we already corrected nucleosomes for 146bp (and thus expected the peak at 0) but it seems that the actual fragments were about 15bp longer - the correction will need be reapplied. But there is more; you can see the repeating nature of the nucleosomes (at large shifts you will start hitting the next nucleosome) and thus you read off the typical nucleosome+linker lenght of 170 or so bp. I found this plot to be the best judge as whether a nucleosome digestion/isolation experiment was successful.

alt text

• 3.5k views

ADD COMMENT • link updated 12.1 years ago by brentp 24k • written 12.5 years ago by Istvan Albert 100k

1

Entering edit mode

could you go into a bit more detail on how shifting a mapped read on one strand to see how many match(?) up on other strand gives insight into fragment length? what do you mean by matching (overlap?)

ADD REPLY • link 12.3 years ago by Ying W ★ 4.2k

1

Entering edit mode

look here: https://github.com/ialbert/bioawk-tools/blob/master/chipfrag.awk

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.3 years ago by Istvan Albert 100k

0

Entering edit mode

a colleague asked if this was real data. he said it looked "too smooth"

ADD REPLY • link 12.5 years ago by Jeremy Leipzig 22k

0

Entering edit mode

it is real data (but I did pick one of the nicest though ;-) ). The line is a loess fit but the points are real - note that it has 300,000 to 400,000 counts per shift thus the positioning errors will get averaged out.

ADD REPLY • link 12.5 years ago by Istvan Albert 100k

0

Entering edit mode

ha, you know I had to check since I wrote this a while ago nowadays I just use it ;-) - it is actually a loess fit that is shown here. indeed it is too smooth to be original data

ADD REPLY • link 12.5 years ago by Istvan Albert 100k

0

Entering edit mode

thanks, I've used this successfully after reading it here.

ADD REPLY • link 12.3 years ago by brentp 24k

score 3 · Answer 1 · 2011-10-07

For a long while I was interested in larger, genome-wide organization - like that of chromosomal elements. Almost on a daily basis I would plot dots and draw loops - tools were any of several dot plotter programs and Miropeats. It was this attention to detail that led to two Cell papers on a centromere-like region on the short arm of Arabidopsis chromosome 4. Fig 4 of one of those papers shows results of both of these tools.

One thing that was nice about this type of work was the range of view - from single base pairs (to define begin and end of a repeat or other element) to the wide view of genome/chromosome organization.

score 2 · Answer 2 · 2012-01-10

2

Entering edit mode

12.3 years ago

brentp 24k

Given a set of samples with males and females, take only the data from the Y chromosome and do a PCA plot.

If everything is OK, there should be nice, distinct groups for males and females.

If samples are mislabeled, males will appear in the female cluster, or vice-versa.

You can also spot out-liers, which should likely be removed from the analysis, as, the green out-lier in the figure below:

pca plot

ADD COMMENT • link 12.3 years ago by brentp 24k

0

Entering edit mode

Hi Brent, we recalled this interesting post of yours here: http://www.biostars.org/post/show/51503/ - but then it occurred to us that we are not sure what is actually being plotted. Would you care to comment?

ADD REPLY • link 11.7 years ago by Istvan Albert 100k

0

Entering edit mode

It's a PCA plot--but only on probes from the sex chromosomes. Each point is a sample. The X-axis is the 1st principal component, the Y-axis is the 2nd principal component. I'll add a comment in the linked discussion as well.

ADD REPLY • link 11.7 years ago by brentp 24k