Wikipedia events per day statistics

Today (2013-10-03) CNN ran an article on how October 3rd supposedly stands out in the year as a day for important events. My gut reaction was that CNN was full of it and just needing to give some intern something to do. Fortunately with the Internet and the command line available on most computers these days. Everyone can do their own research with a little command line magic. By simply counting the number of events for each day on Wikipedia, we can get a better idea of which days of the year stand out as being significant.

In the interest of not overloading Wikipedia with requests, you can create a directory to save each date's page into for later analysis like this:

mkdir wikipedia-dates && cd wikipedia-dates
for i in {0..365}; do date=$( date -d "2012-01-01 + $i day" +%B_%d ); elinks -no-references -no-numbering -dump https://enwp.org/$date > $date ; ls -l $date ; done

You can download this set of files in a tarball I made here.

This generates 366 files in a wikipedia-dates directory that each contain the parsed contents of each date page on Wikipedia. I used 2012 as the base year because it has a leap year in it and we only need to generate the dates of the year starting with January 1st, so the year doesn't really matter, as long as its a leap year so we get all of them. The -no-references and -no-numbering flags to elinks will prevent it from adding a lot of hot link text which would make each file harder to parse.

Now we have a cached set of files for quick analysis. First we do a simple count of the number of events by day for the whole year.

for date in *_[0-9][0-9] ; do printf "$date " ; sed -n '/^Events/,/^Births/p' $date | egrep '^\s+\*' | wc -l ; done | sort -k2nr | nl | column -t | less

This quickly generates a ranked list of most significant dates. Instead of generating all the dates again using date, I decided instead to use the file glob *_[0-9][0-9] since this is a new directory and we know that the files we care about will follow that specific pattern and there shouldn't be any other non-date files that follow that pattern unless you have generated one. For each date generated, it prints out the date without a newline at the end so that it can use the output of the parse to generate the count and the newline. You could also save the results into a variable and then print them afterwards, but this way is a bit more efficient. Each date page is passed into sed where it only prints out the sections between the level 1 headers Events and Births, filtering out only the lines that start with whitespace followed by an asterisk and counting the number of lines returned. We then filter that whole generated list from the for loop into sort -k2nr, which sorts the 2nd column (-k2) in numeric descending order (-nr). That list is sent to the nl command which simply numbers the lines for easy viewing of the ranks and columnt -t to make it easier to see the columns. Easy peezy.

Here are the top 20 dates from that list as of 2013-10-03:

1    January_01    144
2    March_01      89
3    May_01        86
4    July_01       84
5    November_01   81
6    September_11  80
7    August_15     78
8    August_23     77
9    October_01    77
10   September_18  76
11   July_04       74
12   July_25       74
13   April_01      73
14   May_15        69
15   October_14    67
16   May_22        66
17   October_09    66
18   April_06      65
19   October_29    64
20   March_04      63

October 3rd is nowhere to be found. In fact, its rank is 277th out of the year, almost putting it in the bottom quartile. So its fairly insignificant in the grand scheme of history. But wait there is more analysis we can do. Since CNN has a strong American bias, we can limit our analysis to dates in the last 237 years by placing awk into the pipeline:

for date in *_[0-9][0-9] ; do printf "$date " ; sed -n '/^Events/,/^Births/p' $date | egrep '^\s+\*' | egrep -v "[0-9]+ BC" | cut -c9- | awk '{if($1 >= 1776){ print $1}}' | wc -l ; done | sort -k2rn | nl | column -t | less

This actually makes October 3rd even less significant placing it 308th on the list. You'll notice a few more commands I needed to add in order to generate this list. Because Wikipedia uses Unicode zero width spaces after each bullet item, I had to remove that. I first examined the output of all the days and years to make sure that they would all have the same prefix of 5 spaces followed by an asterisk and then a zero width space and cut that. In total, its 8 characters because cut isn't unicode aware and there are 2 bytes of data that make up the unicode character. I also had to remove all the dates that were prior to the year 1 (BC) using the egrep -v "[0-9]+ BC". Once we've cleaned up the data we pass it into awk and only print the first column if the year is greater than or equal to 1776.

There is even more analysis you could do, which is an exercise I'll leave to the reader. Here are some ideas:

BTW: February 29th comes in last with 22 events, but if you multiply that by 4 you get 88, which could be interpreted as being the 3rd most significant date.

climagic home page

Created: 2013-10-03

Updated: 2018-10-22

blog comments powered by Disqus