Clojure for Data Science
上QQ阅读APP看书,第一时间看更新

Visualizing the dwell times

We can plot a histogram of dwell times by simply extracting the :dwell-time column with i/$:

(defn ex-2-2 []
  (-> (i/$ :dwell-time (load-data "dwell-times.tsv"))
      (c/histogram :x-label "Dwell time (s)"
                   :nbins 50)
      (i/view)))

The earlier code generates the following histogram:

Visualizing the dwell times

This is clearly not a normally distributed data, nor even a very skewed normal distribution. There is no tail to the left of the peak (a visitor clearly can't be on our site for less than zero seconds). While the data tails off steeply to the right at first, it extends much further along the x axis than we would expect from normally distributed data.

When confronted with distributions like this, where values are mostly small but occasionally extreme, it can be useful to plot the y axis as a log scale. Log scales are used to represent events that cover a very large range. Chart axes are ordinarily linear and they partition a range into equally sized steps like the "number line" we learned at school. Log scales partition the range into steps that get larger and larger as they go further away from the origin.

Some systems of measurement for natural phenomena that cover a very large range are represented on a log scale. For example, the Richter magnitude scale for earthquakes is a base-10 logarithmic scale, which means that an earthquake measuring 5 on the Richter scale is 10 times the magnitude of an earthquake measuring 4. The decibel scale is also a logarithmic scale with a different base—a sound wave of 30 decibels has 10 times the magnitude of a sound wave of 20 decibels. In each case, the principle is the same—the use of a log scale allows a very large range of values to be compressed into a much smaller range.

Plotting our y axis on log-axis is simple with Incanter with c/set-axis:

(defn ex-2-3 []
  (-> (i/$ :dwell-time (load-data "dwell-times.tsv"))
      (c/histogram :x-label "Dwell time (s)"
                   :nbins 20)
      (c/set-axis :y (c/log-axis :label "Log Frequency"))
      (i/view)))

By default Incanter will use a base-10 log scale, meaning that each tick on the axis represents a range that is 10 times the previous step. A chart like this—where only one axis is shown on a log scale—is called log-linear. Unsurprisingly, a chart showing two log axes is called a log-log chart.

Visualizing the dwell times

Plotting dwell times on a log-linear plot shows hidden consistency in the data—there is a linear relationship between the dwell time and the logarithm of the frequency. The clarity of the relationship breaks down to the right of the plot where there are fewer than 10 visitors but, aside from this, the relationship is remarkably consistent.

A straight line on a log-linear plot is a clear indicator of an exponential distribution.