Those of us who live other than under rocks will no doubt be aware of the latest controversy over Wikipedia's approach to biographies of living persons articles (BLPs), concerning the deletion last week of a large number of BLPs that had been tagged as being unsourced, and had not been edited for more than six months. The deletions sparked a giant administrators' noticeboard discussion, a request for arbitration and now a request for comments on how to proceed from here.
At the crux of the dispute is how seriously the project is to take the modified standards that it has adopted with respect to biographies of living persons.
Debates of this sort are usually run along inclusionist/deletionist lines, but really the more important philosophical dichotomy when it comes to BLPs is between eventualists and immediatists. Wikipedia on the whole favours an eventualist perspective - facilitated by the almost immeasurably large potential pool of labour out there - but the BLP policy is essentially a localised switch to immediatism: unsourced material needs to be sourced post-haste, or else removed.
Conceptually it's an elegant and attractive approach. But a major flaw with it is our attraction to eventualism. We just can't shake it off.
This category, and its many subcategories, tracks BLP articles that have been tagged as not having any sources. At the time of writing there are over 47,000 of them, some having been tagged as long ago as December 2006. Evidently any sense of urgency has passed those by. The backlogs mount until they approach the point where individual editors have difficulty comprehending the problem, let along working to address it. Frustration builds at the inevitable inertia, until something radical happens, like these mass deletions.
Is this view accurate? Is the problem of unsourced BLPs really out of hand? We can try to answer these questions by looking at the way the backlog has been managed.
Unfortunately, the data available for this purpose is somewhat limited. Database dumps older than the 20 September 2009 dump are currently not available due to maintenance. However that September dump, along with dumps from 28 November 2009 and 16 January this year (shortly before the deletions started), do offer three data points with which to commence.
The monthly subcategories from October 2006 to August 2009 inclusive were common to all three dumps. The total number of articles in these categories declined from 50,715 in September to 43,655 earlier this month, a 13.9% fall. However, over the same period, the total in all subcategories through December 2009 rose from 50,715 to 51,301, a 1.2% increase. At least over this period, new additions outweighed articles being removed from these categories.
It should be noted that some of these additions are due to articles that had been tagged, but were unsorted, being added into the monthly subcategories. In fact, ten of the thirty-five subcategories common to all three dumps saw increases in numbers since September. The following graph shows the change in the monthly category totals over the roughly four months between the September and January dumps:
Without analysing the actual changes in the lists of articles in these subcategories it won't be possible to tell whether the sorting process is merely outweighing the normal reductions through articles being referenced or deleted, or, as I suspect, if there are genuinely fewer reductions in these subcategories that are no longer recent, but not yet the oldest. This can be the subject of further inquiry.
What we can say now is that the total number of unreferenced BLPs is now showing real decline for at least the first time in four months, possibly longer. It seems to have been the shock of mass deletions that has spurred people into action either to fix or delete these articles. Hopefully the shock will last long enough for a significant reduction to be achieved.
Saturday, 30 January 2010
What happens to unreferenced BLPs?
Posted by
Stephen
at
1:45 pm
1 comments
Labels: biographies of living persons, deletion, statistics
Sunday, 17 May 2009
New tools
A couple of new tools I've put together that people might find some use for:
- Admin activity statistics: shows some statistics on how many admins have used their tools at all over various timeframes, and on how many actions are taken by each active admin over various timeframes. Works on any Wikimedia project.
- Per-page contributions: like [[Special:Contributions]], but shows contributions just to a particular page. Works on any Wikimedia project. I've already found it quite useful in several arbitration cases, especially for users who have made a large number of edits, or for pages which have been edited many times.
The image below is one of the graphs produced by the admin activity tool, it shows how many admins have performed at least one administrative action over various timeframes on the English Wikipedia:
Posted by
Stephen
at
12:42 pm
4
comments
Labels: statistics, tools
Monday, 20 April 2009
More bug statistics
Last November I put together some simple charts with the information from the weekly bug statistics that are automatically generated for the wikitech-l mailing list. There's now thirty-two weeks of data available, so here are some updated charts.
The distribution of resolution types seems to have stayed more or less the same over time, continuing the pattern seen in the original charts:
However, there are some changes in the other graph, which is based on information about the number of bugs each week. It shows the number of new, reopened, assigned and resolved bugs each week (using the scale on the left) and the total number of open bugs (in blue, using the scale on the right):
While there is still the same rough correlation between the number of new bugs and the number of bugs resolved each week, there is also a steady trend upwards in the total number of open bugs. Indeed, the total has risen nearly 20% since October last year.
So what are the consequences of so many bugs being opened but not dealt with? The following chart, generated by Bugzilla directly, shows the distribution of the "severity" parameter of all currently open bugs:
It shows that three-fifths of open bugs have severity given as "enhancement", essentially meaning that they're feature requests, entered into Bugzilla for tracking purposes, rather than being true bugs. A further 13% are marked "trivial" or "minor", and nearly a quarter "normal"; only 3% are "major".
So while the number of unresolved bugs is steadily rising, most of these are either feature requests or only minor bugs. Still, the backlog is fairly steadily getting worse, a reminder that it's constantly necessary for new volunteer developers to become involved with improving MediaWiki.
Posted by
Stephen
at
11:55 pm
1 comments
Labels: bugs, statistics
Tuesday, 18 November 2008
Bug statistics
Since the beginning of September, the bug tracker for MediaWiki has been sending weekly updates to the Wikitech-l mailing list, with stats on how many bugs were opened and resolved, the type of resolution, and the top five resolvers for that week. With eleven weeks of data so far, some observations can be made.
The following graph shows the number of new, resolved, reopened and assigned bugs per week (dates given are the starting date for the week). The total number of bugs open that week is shown in blue, and uses the scale to the right of the graph:
The total number of open bugs has been trending upwards, but only marginally, over the past couple of months. It will be interesting to see, with further weekly data, where this trend goes.
It also seems that the number of bugs resolved in any given week tends to go up and down in tandem with the number of new bugs reported in that week. Although there is no data currently available on how quickly bugs are resolved, I would speculate that most of the "urgent" bugs are resolved within the week that they are reported, which would explain the correlation.
Note also the spike in activity in the week beginning 6th October; this was probably the result of the first Bug Monday.
The second graph shows the breakdown of types of bug resolutions:
The distribution seems fairly similar week on week, with most resolutions being fixes. It's interesting to note that regularly around 25% to 35% of bug reports are problematic in some way, whether duplicates or bugs that cannot be reproduced by testers.
The weekly reports are just a taste of the information available about current bugs; see the reports and charts page for much more statistic-y goodness. And kudos to the developers who steadily work away each week to handle bugs!
Posted by
Stephen
at
7:08 pm
1 comments
Labels: bugs, MediaWiki, statistics
Monday, 2 June 2008
Rambot redux
FritzpollBot is a name you're likely to be hearing and seeing more of: it's a new bot designed to create an article on every single town or village in the world that currently lacks one, of which there are something like two million. The bot gained approval to operate last week, but there's currently a village pump discussion underway about it.
FritzpollBot has naturally elicited comparisons with rambot, one of the earliest bots to edit Wikipedia. Operated by Ram-Man, first under his own account and then under a dedicated account, rambot created stubs on tens of thousands of cities and towns in the United States starting in late 2002.
It's hard for people now to get a sense of what rambot did, but its effects even now can be seen. All told, rambot's work represented something close to a doubling of Wikipedia's size in a short space of time (the bulk of the work, more than 30,000 articles, being done over a week or so in October 2002). The noticeable bump that it produced in the total article count can still be seen in present graphs of Wikipedia's size. Back then the difference was huge. I didn't join the project until two years after rambot first operated, but even then around one in ten articles had been started by rambot, and one would run into them all the time.
During its peak, rambot was adding articles so fast that the growth rate per day achieved in October 2002 has never been outstripped, as can be seen from the graph below (courtesy Seattle Skier at Commons):
There was some concern about rambot's work at the time: see this discussion about rambot stubs clogging up the Special:Random system, for example. There were also many debates about the quality and content of the stubs, many of which contained very little information other than the name and location of the town.
The same arguments that were made against rambot at the time, mainly to do with the project's ability to maintain so many new articles all at once, are being made again with respect to FritzpollBot. In the long run, the concerns about rambot proved to be ill-founded, as the project didn't collapse, and most (if not all) of the articles have now been absorbed into the general corpus of articles. The value of its work was ultimately acknowledged, and now there are many bots performing similar tasks.
In addition to the literal value of rambot's contributions, there's a case to argue that the critical mass of content that rambot added kickstarted the long period of roughly exponential growth that Wikipedia enjoyed, lasting until around mid-2006. I don't think it's unreasonable to suggest that having articles on every city or town in the United States, even if many were just stubs, was a significant boon for attracting contributors. From late 2002 on, every American typing their hometown or their local area into their favourite search engine would start to turn up Wikipedia articles among the results, undoubtedly helping to attract new contributors. The stubs served as a base for redlinks, which in turn helped build the web and generate an imperative to create content. Repeating the process for the rest of the world, as FritzpollBot promises to do, would thus be an incredibly valuable step.
Furthermore, as David Gerard observes, when rambot finished its task the project had taken its first significant step towards completeness on a given topic. Rambot helped the project make its way out of infancy; now in adolescence, systemic bias is one of the major challenges it faces, and hopefully FritzpollBot can help existing efforts in this regard. Achieving global completeness across a topic area as significant as the very places that humans live would be a massive accomplishment for the project.
Let's see those Ws really cover the planet.
Posted by
Stephen
at
2:26 am
1 comments
Labels: bots, statistics, Wikipedia
Saturday, 29 March 2008
Wikipedia's downstream traffic
We've been hearing for a while about where Wikipedia's traffic comes from, but here are some new stats from Heather Hopkins at Hitwise on where traffic goes to after visiting Wikipedia. Hopkins had produced some similar stats back in October 2006, and it's interesting to compare the results.
Wikipedia gets plenty of traffic from Google (consistently around half) and indeed other search engines, but what's interesting is that nearly one in ten users go back to Google after visiting Wikipedia, making it the number one downstream destination. Yahoo! is also a popular post-Wikipedia destination.
It was nice to see that Wiktionary and the Wikimedia Commons both make it into the top twenty sites visited by users leaving Wikipedia.
Hopkins also presents a graph illustrating destinations broken down by Hitwise's categories. More than a third of outbound traffic is to sites in the "computers and internet" category, and around a fifth to sites in the "entertainment" category, which probably ties in with the demographics of Wikipedia readers, and the general popularity of pop culture, internet and computing articles on Wikipedia.
Hopkins makes another interesting point on the categories, that large portions of the traffic in each category are to "authority" sites:
"Among Entertainment websites, IMDB and YouTube are authorities. Among Shopping and Classifieds it's Amazon and eBay. Among Music websites it's All Music Guide For Sports it's ESPN. For Finance it's Yahoo! Finance. For Health & Medical it's WebMD and United States National Library of Medicine."
Similarly, Doug Caverly at WebProNews states that the substantial proportion of traffic returning to search engines after visiting Wikipedia "probably indicates that folks are continuing their research elsewhere", and this ties in well with Hopkins' observation about the strong representation of reference sites.
All of this suggests that Wikipedia is being used the way that it is really meant to be used: as a first reference, as a starting point for further research.
Posted by
Stephen
at
12:13 am
0
comments
Labels: statistics, Wikipedia
Monday, 17 March 2008
Protection and pageviews
Henrik's traffic statistics viewer, a visual interface to the raw data gathered by Domas Mituzas' wikistats page view counter, has generated plenty of interest among the Wikimedia community recently. Last week Kelly Martin, discussing the list of most viewed pages, wondered how many page views are of protected content; that thought piqued my interest, so I decided to dust off the old database and calculator and try to put a number to that question.
The data comes from the most viewed articles list covering the period from 1 February 2008 to 23 February 2008. I've used that data, and data on protection histories from the English Wikipedia site, to come up with some stats on page protection and page views. There are some limitations: I don't have gigabytes of bandwidth available, so some of the stats (on page views in particular) are estimates, and protection logs turn out to be pretty difficult to parse, so I've focused on collecting duration information rather than information on the type of protection (full protection, semi-protection etc). Maybe that could be the focus of a future study.
There were 9956 pages in the most viewed list for February 1 to February 23 2008. Excluding special pages, images and non-content pages, there were 9674 content pages (articles and portals) in the list. Interestingly, only 3617 of these pages have ever been protected, although each page that has been protected at least once has, on average, been under protection nearly three times.
Protection statistics
Only 1223 (12.6%, about an eighth) of the pages were edit protected at some point during the sample period, 902 of those for the entire period (a further 92 were move protected only at some point, 69 of those for the entire period). Each page that had some period of protection was protected for, on average, 82.9% of the time (just under 20 days), though if the pages protected for the whole period are excluded, the average period spent protected was only 34.8% of the time (just over eight days).
The following graph shows the distribution of the portion of the sample period that pages spent protected, rounded down to the nearest ten percent:
The shortest period of protection during the period was for Vicki Iseman, protected on 21 February by Stifle, who thought better of it and unprotected just 38 seconds later.
Among the most viewed list for February, the page that has been protected the longest is Swastika, which has been move protected continuously since 1 May 2005 (more than 1050 days). The page that has been edit protected the longest is Marilyn Manson (band), which has been semi-protected since 5 January 2006 (more than 800 days).
Interestingly, the average length of a period of edit protection across these articles (through their entire history) is around 46 days and 16 hours, whereas the average length of a period of move protection is lower, at 41 days 14 hours. I had expected the average bout of move protection to last longer, although almost all edit protections do include move protections.
The next graph shows the distribution of protection lengths across the history of these pages, for periods of protection up to 100 days in length (the full graph goes up to just over 800 days):
Note the large spikes in the distribution at seven and fourteen days, the smaller spike at twenty-one days and the bump from twenty-eight to thirty-one days, corresponding to protections of four weeks or one month duration (MediaWiki uses calendar months, so one month's protection starting January will be 31 days long, whereas one month's protection starting September will be 30 days long).
The final graph shows the average length of protection periods (orange) and the number of protection periods applied (green) in each month, over the last four years:
At least on these generally popular articles, protection got really popular towards the end of 2006 into the beginning of 2007, and again a year later. However, it seems that protection lengths peaked around the middle of 2007 and have been in decline since then.
Protection and pageviews
What really matters here though is the pageviews. The 9674 content pages in the most viewed list were viewed a total of 805,569,269 times over the relevant period. The 1223 pages that were edit protected for at least part of the period were viewed a total of 270,057,550 times (33.5%), with approximately 247 million of these pageviews coming while the pages were protected.
This is a really substantial number of pageviews, however, this number includes the Main Page, which alone accounts for more than 114 million of those pageviews. Leaving the Main Page out of the equation gives a healthier figure of around 133 million views to protected pages during the relevant period (and remember, this is only counting pages on the most viewed list).
Conclusions
Although only one in eight of the pages in the most viewed list were protected at some point during the relevant period, they tended to be higher-profile ones, accounting for one third of the page views. The pages that were protected at some point tended to be protected alot of the time, three-quarters of them for the entire sample period. This certainly fits with what many people have already suspected, that a small pool of high-profile articles attract plenty of attention in the form of both page protection and page views.
It will be interesting to do some more analysis on the history of page protection. Based on just this small sample, it seems that average protection lengths are trending downwards, which could well be something to do with the advent of timed protection. Hopefully I'll have some more insights to come.
Posted by
Stephen
at
1:13 am
1 comments
Labels: pageviews, protection, statistics, Wikipedia
Friday, 6 April 2007
Wikimedia traffic patterns
Daniel Tobias posted to the English Wikipedia mailing list recently about Alexa's traffic statistics for Wikipedia, suggesting, among other things, that page views seem to peak on a Sunday. However, it has often been noted that there are problems with using Alexa's data in certain ways (they only sample people with the Alexa toolbar, for starters) and so I much prefer to look at our own statistics, the request graphs and traffic graphs hosted on the toolserver by Leon Weber (although it seems the script was written by someone else).
Let's take a look at a weekly graph (you can see the current weekly graph here):
All of these graphs are based on UTC. The black vertical line represents the beginning of a new week, which starts on a Monday. As you can see, the lowest days across this sample period are Saturday and Sunday, with the highest being on Monday to Thursday. It seems safe to infer that more people use Wikimedia projects during the working week than they do on the weekend.
Now let's look at a daily graph (you can see the current daily graph here):
The black vertical lines on this graph represent midnight UTC. The daily peak across all clusters occurs around 14:00 to 21:00 UTC, with a fairly steep decline on either side of that time period.
But that's the overall data. Things start to get interesting when you break it down into clusters.
Looking at the pmtpa and images clusters (the blue part of the graph), there's a fairly sustained high level from around 14:00 UTC to around 04:00 UTC the next day. It's a little hard to tell with the stacking graph, but the knams and knams-img clusters (green) both seem to have a sustained high level from around 08:00 to 22:00 UTC. Finally, the yaseo and yaseo-img clusters (yellow) seem to get the most traffic between 02:00 and 16:00 UTC.
Why the different times? Well, the pmtpa and images clusters are in Tampa, Florida (the Power Medium datacenter) and so the peak there, from 14:00 to 04:00 UTC, corresponds to the period from 10 am to midnight on the US East coast, and 7 am to 9 pm on the US West coast. So the peak for the Tampa cluster essentially corresponds to waking hours in the US. The knams and knams-img clusters are in Amsterdam, the Netherlands (hosted by Kennisnet) and the local peak there corresponds to waking hours across Europe. Finally, yaseo and yaseo-img are in Seoul, South Korea (hosted by Yahoo!) and their local peak, not surprisingly, corresponds to waking hours in East Asia.
Another interesting observation is that for the Tampa clusters there is a clear drop in requests about two-thirds of the way through the high period... which more or less corresponds to tea time. The latter part of the peak (the local evening) is still high, but not as high as during daylight.
So to tie this all together, most people use Wikimedia projects between 8 am and 10 pm local time, no matter where they are in the world. People also use Wikimedia much more during the working week than they do on weekends. Finally, people use the projects less when they are eating dinner, and are less likely to return to browsing on a full stomach.
Now there's some food for thought.
Posted by
Stephen
at
1:40 pm
0
comments
Labels: statistics

