data:image/s3,"s3://crabby-images/896ab/896ab2a38df8acc0d9b991d98008dc8e418c6f17" alt="Clojure for Data Science"
The exponential distribution
The exponential distribution occurs frequently when considering situations where there are many small positive quantities and much fewer larger quantities. Given what we have learned about the Richter scale, it won't be a surprise to learn that the magnitude of earthquakes follows an exponential distribution.
The distribution also frequently occurs in waiting times—the time until the next earthquake of any magnitude roughly follows an exponential distribution as well. The distribution is often used to model failure rates, which is essentially the waiting time until a machine breaks down. Our exponential distribution models a process similar to failure—the waiting time until a visitor gets bored and leaves our site.
The exponential distribution has a number of interesting properties. One relates to the mean and standard deviation:
(defn ex-2-4 [] (let [dwell-times (->> (load-data "dwell-times.tsv") (i/$ :dwell-time))] (println "Mean: " (s/mean dwell-times)) (println "Median:" (s/median dwell-times)) (println "SD: " (s/sd dwell-times)))) Mean: 93.2014074074074 Median: 64.0 SD: 93.96972402519796
The mean and standard deviations are very similar. In fact, for an ideal exponential distribution, they are exactly the same. This property holds true for all the exponential distributions—as the mean increases, so does the standard deviation.
A second property of the exponential distribution is that it is memoryless. This is a counterintuitive property best illustrated by an example. We expect that as a visitor continues to browse our site, the probability of them getting bored and leaving increases. Since the mean dwell time is 93 seconds, it might appear that beyond 93 seconds, they are less and less likely to continue browsing.
The memoryless property of exponential distributions tells us that the probability of a visitor staying on our site for another 93 seconds is exactly the same whether they have already been browsing the site for 93 seconds, 5 minutes, an hour, or they have just arrived.
The memoryless property of exponential distributions goes some way towards explaining why it is so difficult to predict when an earthquake will occur next. We must rely on other evidence (such as a disturbance in geomagnetism) rather than the elapsed time.
Since the median dwell time is 64 seconds, about half of our visitors are staying on the site for only around a minute. A mean of 93 seconds shows that some visitors are staying much longer than that. These statistics have been calculated on all the visitors over the last 6 months. It might be interesting to see how these statistics vary per day. Let's calculate this now.
The distribution of daily means
The file provided by the web team includes the timestamp of the visit. In order to aggregate by day, it's necessary to remove the time portion from the date. While we could do this with string manipulation, a more flexible approach would be to use a date and time library such as clj-time
(https://github.com/clj-time/clj-time) to parse the string. This will allow us to not only remove the time, but also perform arbitrarily complex filters (such as filtering to particular days of the week or the first or last day of the month, for example).
The clj-time.predicates
namespace contains a variety of useful predicates and the clj-time.format
namespace contains parsing functions that will attempt to convert the string to a date-time object using predefined standard formats. If our timestamp wasn't already in a standard format, we could use the same namespace to build a custom formatter. Consult the clj-time
documentation for more information and many usage examples:
(defn with-parsed-date [data] (i/transform-col data :date (comp tc/to-local-date f/parse))) (defn filter-weekdays [data] (i/$where {:date {:$fn p/weekday?}} data)) (defn mean-dwell-times-by-date [data] (i/$rollup :mean :dwell-time :date data)) (defn daily-mean-dwell-times [data] (->> (with-parsed-date data) (filter-weekdays) (mean-dwell-times-by-date)))
Combining the previous functions allows us to calculate the mean, median, and standard deviation for the daily mean dwell times:
(defn ex-2-5 [] (let [means (->> (load-data "dwell-times.tsv") (daily-mean-dwell-times) (i/$ :dwell-time))] (println "Mean: " (s/mean means)) (println "Median: " (s/median means)) (println "SD: " (s/sd means)))) ;; Mean: 90.210428650562 ;; Median: 90.13661202185791 ;; SD: 3.722342905320035
The mean value of our daily means is 90.2 seconds. This is close to the mean value we calculated previously on the whole dataset, including weekends. The standard deviation is much lower though, just 3.7 seconds. In other words, the distribution of daily means has a much lower standard deviation than the entire dataset. Let's plot the daily mean dwell times on a chart:
(defn ex-2-6 [] (let [means (->> (load-data "dwell-times.tsv") (daily-mean-dwell-times) (i/$ :dwell-time))] (-> (c/histogram means :x-label "Daily mean dwell time (s)" :nbins 20) (i/view))))
This code generates the following histogram:
data:image/s3,"s3://crabby-images/49ab4/49ab45b69bc591d82ef823c38e586348d5c3bbb0" alt="The distribution of daily means"
The distribution of sample means is distributed symmetrically around the overall grand mean value of 90 seconds with a standard deviation of 3.7 seconds. Unlike the distribution from which these means were sampled—the exponential distribution—the distribution of sample means is normally distributed.