The Facebook Data Team recently published a ground-breaking piece of collaborative research, by Ugander, Karrer, Backstrom, and Marlow, which characterised the network structure of the entire active population of Facebook as of May 2011. That’s 721 million active users, who between them had formed 68.4 billion Facebook-friend relationships. You can read a summary of findings here, and the entire article can be downloaded from here.
Computing the network metrics characterising this population provided employment for 2,250 servers, deployed in a Hadoop cluster. Awesome™. Some of it was done on a a 64 GB machine via a stream-based algorithm, and some on a 24-core 72 GB machine via a novel algorithm. (Still Awesome™.)
Why did they look at all their data, when doing so was pretty difficult? Personally, I think they fed their whole active user base into their crunchers partly because they could. Facebook is not an ordinary engineering company, and it is not just an engineering company, but it is definitely an engineering company.
But the reason given in the text is different:
Network completeness is especially important in the study of online social networks because unlike traditionals social science research , the members of online social networks are not controlled random samples, and instead should be considered biased samples.(p.2)
Now, that is really interesting. It is interesting not because it’s flawed, but because of the way it’s flawed. Making the Facebook ‘sample’ as big as humanly possible doesn’t help, if what you want to talk about is the structural characteristics of human social networks. The whole population of Facebook is a biased sample if you view it as a sample of human social relationships. The logic of the authors’ assertion that you need to look at the whole population because using just a sample is biased is wrong. Even if you use the whole population, the entire Facebook universe is still biased as a sample if what you want to talk about is ‘human social networks’.
However, if what you want to talk about is the nature of Facebook-friendships, being able to look at the social graph patterns from the whole population of active Facebook users is just amazing. (And, of course, Awesome™.) But there’s simply no need to confuse Facebook with reality, and say the reason you are looking at your entire data set is to avoid sample bias. Looking at the whole shooting match doesn’t help.
The authors go on to say:
…the most accurate representation of our social relationships will include as many people as possible. We are not there yet, but in this paper we characterise the entire social network of active members of Facebook in May 2011… (p. 2)
Again, a pivotal bit of confusion revealed by choice of language. The study reveals the structure of Facebook-friendships, and Facebook-social-networks. It does not study “the entire social network of active members of Facebook”.
The question of how people form and maintain relationships using Facebook is a truly fascinating one. So, too, is the question of how the design of Facebook facilitates and influences these fundamental human processes. The answers are important for a variety of reasons, theoretical and applied, personal and commercial. They are important to a huge range of stakesholders in the system in addition to Facebook itself. So I’m really glad Facebook thrashed thousands of servers to do this study, and I’m gladder still that they published it. I am looking forward to cherry-picking some of the tastier findings in future posts.
But I will always have social connections who aren’t connected to me on Facebook.