Friday 6 April 2007

Wikimedia traffic patterns

Daniel Tobias posted to the English Wikipedia mailing list recently about Alexa's traffic statistics for Wikipedia, suggesting, among other things, that page views seem to peak on a Sunday. However, it has often been noted that there are problems with using Alexa's data in certain ways (they only sample people with the Alexa toolbar, for starters) and so I much prefer to look at our own statistics, the request graphs and traffic graphs hosted on the toolserver by Leon Weber (although it seems the script was written by someone else).

Let's take a look at a weekly graph (you can see the current weekly graph here):



All of these graphs are based on UTC. The black vertical line represents the beginning of a new week, which starts on a Monday. As you can see, the lowest days across this sample period are Saturday and Sunday, with the highest being on Monday to Thursday. It seems safe to infer that more people use Wikimedia projects during the working week than they do on the weekend.

Now let's look at a daily graph (you can see the current daily graph here):



The black vertical lines on this graph represent midnight UTC. The daily peak across all clusters occurs around 14:00 to 21:00 UTC, with a fairly steep decline on either side of that time period.

But that's the overall data. Things start to get interesting when you break it down into clusters.

Looking at the pmtpa and images clusters (the blue part of the graph), there's a fairly sustained high level from around 14:00 UTC to around 04:00 UTC the next day. It's a little hard to tell with the stacking graph, but the knams and knams-img clusters (green) both seem to have a sustained high level from around 08:00 to 22:00 UTC. Finally, the yaseo and yaseo-img clusters (yellow) seem to get the most traffic between 02:00 and 16:00 UTC.

Why the different times? Well, the pmtpa and images clusters are in Tampa, Florida (the Power Medium datacenter) and so the peak there, from 14:00 to 04:00 UTC, corresponds to the period from 10 am to midnight on the US East coast, and 7 am to 9 pm on the US West coast. So the peak for the Tampa cluster essentially corresponds to waking hours in the US. The knams and knams-img clusters are in Amsterdam, the Netherlands (hosted by Kennisnet) and the local peak there corresponds to waking hours across Europe. Finally, yaseo and yaseo-img are in Seoul, South Korea (hosted by Yahoo!) and their local peak, not surprisingly, corresponds to waking hours in East Asia.

Another interesting observation is that for the Tampa clusters there is a clear drop in requests about two-thirds of the way through the high period... which more or less corresponds to tea time. The latter part of the peak (the local evening) is still high, but not as high as during daylight.

So to tie this all together, most people use Wikimedia projects between 8 am and 10 pm local time, no matter where they are in the world. People also use Wikimedia much more during the working week than they do on weekends. Finally, people use the projects less when they are eating dinner, and are less likely to return to browsing on a full stomach.

Now there's some food for thought.

Monday 2 April 2007

Interesting exercise

With the debate about the Attribution policy merger still going strong, I got to reading the position papers prepared by some of the prominent proponents on each side (broad agreement and broad disagreement) of the debate. While considering those, and some of the responses in the ongoing poll, it struck me that there is a remarkable degree of difference in understanding of Wikipedia's fundamental content policies.

I was particularly intrigued by some of the comments on both sides of the debate which have discussed the ways in which Wikipedia policies have evolved over time; the people supporting Attribution arguing that policies have always been changed, and some of the people opposing it arguing that policies have changed away from their original meaning. This got me thinking about the degree to which change has actually occurred with these long standing policies.

I've always thought that the core policies in particular were essentially well understood concepts that haven't really changed much, and that the development of policy pages over time has merely been a refinement of the expression of the central idea, and an adaptation to meet changing circumstances. I decided to test whether this was really the case by, quite simply, looking at old versions of policy pages.

Here's how the core content policies looked on my first day of editing (8 October 2004):


For both verifiability and no original research, the version that existed when I first edited was within the first fifty revisions of the page. Indeed, verifiability had only been edited by a dozen different users by the time of this version. NPOV had something of a longer history, having been around since the beginning of 2002 (and longer than that as an idea).

There are a few interesting nuggets in these old versions. Most surprisingly to me is that in the old version of no original research, the page posits Wikipedia as either a secondary source or as a tertiary source, whereas I've always considered Wikipedia to be only a tertiary source, as a necessary consequence of having the NPOV policy. You'll see that the old version excludes original ideas, but permits analysis, evaluation, interpretation and synthesis as legitimate techniques in writing articles. This would, I am sure, come as a surprise to many (my homework for today: find when the prohibition on synthesis was introduced).

It's also interesting to observe that contrary to what some people assert, the verifiability policy even back then was all about checking that sources have been used accurately and correctly (ie, not misrepresenting the sources), and not about only including content that can be proved to be true. The old version of verifiability also included a section about reliable sources, and offered a classic formulation which I still regard as eminently valid:

"Sometimes a particular statement can only be verified at a place of dubious reliability, such as a weblog or tabloid newspaper. If the statement is relatively unimportant, then just remove it - don't waste words on statements of limited interest and dubious truth. However, if you must keep it, then attribute it to the source in question. For example: 'According to the weblog Simply Relative, the average American has 3.8 cousins and 7.4 nephews and nieces.' "


I'm sure that there is more to be learned from these old versions of policy. I would encourage everyone who is interested in this subject to check out a history page for yourself: perhaps you'd like to view the policies as they existed when you first edited, or perhaps you'd like to delve even deeper into the past than that.

The historical development of policy can offer great insights into how it can be developed in the future.