Adventures in Data

A blog about playing with data

Population Density Maps in Canadian Cities

I’m a big fan of maps, in particularly those displaying data. One of the best sources of data out there is the Canadian census, which I’m using here to give a view of population density in different cities. What I’ve put onto these maps is the population number divided by the area each polygon creates, which is a measurement of population density. The scales aren’t consistent between cities, so don’t compare across maps. I’ve only made maps for cities I’m somewhat familiar with, but I could make more for other cities if people are interested.

Link (Zoomable)

Link (Zoomable)

Link (Zoomable)

Link (Zoomable)

Most Dangerous Bike Intersections in Toronto

I am a cyclist in Toronto, and I find it quite frightening at times with all the traffic. I hear a lot about cycle accidents in and around the city, and as a data nerd I was happy to find that the Toronto Traffic Safety Unit has been collecting GPS tagged data on where bicycle accidents have been occurring throughout the city. This is only the reported subset of all the accidents, but it represents about 31000 collisions. Here I am using that data to try and determine where the most dangerous cycle intersections in Toronto are.

What I did it to use the GPS coordinates recorded for the accidents to map them to their closest intersection. The measurement for each intersection contains accidents at that intersection as well as on the street nearby. Since the intersections with the highest number of accidents are just those with the highest amount of traffic I normalized with traffic data from the Toronto Traffic Safety Unit. In doing so we run into a well known statistical phenomenon which is that intersections with the smallest number of accidents dominate, because of the noise in the observation.

In order to deal with this we can use Bayesian statistics; I’m putting a prior on what we think the observation should be we can prevent the noise in small sample size intersections from dominating. I’m using a technique called Empirical Bayesian analysis which calculates a prior from the average of all the intersections in the dataset.

What do we end up with after doing this? The following is a list of the most dangerous intersections we recover with this technique.

Many of the intersections we see are on Bloor or Queen which are significant on the routes that do not have bike lanes.

All of these intersections are not straight.

Avenue and Lonsdale

Bloor and Parliament

Broadview and Gerrard

Stay safe, here’s a list of the top 50 most dangerous intersections that I determined using this method.
Top50 (Excel)
Top50 (csv)

As requested, a complete List.
Complete List (csv)

Visualizing One Day of Bixi Activity

Bixi is a bike sharing service in Toronto, which allows short trips between stations. They provide a live update on the current number of bikes at each station so I downloaded a day’s worth of data and created an animation showing how many bikes are at each Bixi station. This is my visualization of 24 hours worth of Bixi data from July 30th, 2013. Observe the movement towards the center ~9am and the movement back ~6pm.

There was a request for the same map but with the absolute number of bikes which is Here

A Subreddit Interaction Map

I’m a redditor, so naturally I was thrilled to find a data set of voting patterns from reddit users who have made their votes public. In this post I am showing a visualization of that data. What I did, is try and find groups of subreddits which are used by the same user. Specifically, for every pair of subreddits I asked if users use them both more often than would be expected by chance (a chi-squared test, with multiple test correction). Then I took the residual of the test (a measure of how many more users vote in both than expected by chance) and used that as a link between the 2 subreddits.

Next I ran a clustering algorithm (Markov clustering) to break up this graph into manageable groups of subreddits. Below is the full graph, I’ve colored a subset of the clusters, and its tough to see from this zoom, but I’m going to go through the groups in more depth below. Subreddits in gray ended up in smaller clusters.

High Resolution Version

This view contains a number of the major subreddits, in particulary subreddits relating to politics (GREEN) as well as “depth” topics (RED, PINK YELLOW).

This group has stuff like “pics”, “funny”, “aww” and I call it the lighter side of reddit (PURPLE)

This is a whole bunch of subreddits relating to computer programming (BLUE)

This is a group of pornographic subreddits (LIGHT BLUE)

I am going to call this group “safe for work porn” (ORANGE)

This is a group of computer game subreddits (YELLOW) as well as non-gaming role playing and fantasy (GREEN).

In this corner we have music related subreddits (SHADES OF ORANGE).

That’s all that I’m going to comment on, there are quite a few more groups that I don’t have enough space to go through, but I’ve attached the full list of clusters that come out of this Full list of clusters


I’ve started this blog to share various data analysis projects that I undertake. I am a PhD student who plays a lot with large data sets about where genes are express and what they interact with. But I’ve realized that my interest in playing with data really isn’t limited to just this, as they say – don’t let school interfere with your education.