Information Flows during the 2011 Tunisian and Egyptian Revolutions (1-3)
This article which was published by the International Journal of Communication details the networked production and dissemination of news on Twitter during snapshots of the 2011 Tunisian and Egyptian Revolutions as seen through information flows—sets of near-duplicate tweets—across activists, bloggers, journalists, mainstream media outlets, and other engaged participants. It differentiates between these user types and analyze patterns of sourcing and routing information among them.
Information Flow Identification
We define an information flow as an ordered set of near-duplicate tweets. We identify these flows by finding very similar tweets in our datasets using the shingling method for string comparison, which converts a string of text (such as a tweet) into a fingerprint summary of the words it comprises. This fingerprint can then be efficiently compared against other strings (other tweets) to find near-duplicates. This methodology parallels the one used in Lotan’s visual analysis of tweets surrounding the 2009 Iranian election protests.
For each dataset, we sorted these sets of flows by total number of tweets, thus creating a rank-ordered set of retweeted posts. Since our goal was to characterize the most common information flows and assess users’ roles in dissemination, we wanted to make sure that our sample set included the longer flows, and not flows that consisted of small numbers of users. For that reason, we selected the top 10% of this rankorder.
We recognize that this is not a representative sample of tweets but, rather, a selection of the most prominent information flows. In the same manner, we could have defined a flow as any group of retweets that included 19 or more posts in the Egypt dataset, and 16 or more in the Tunisia dataset.
We then randomly chose one-sixth of this top 10%, which resulted in a sample of 500 Egypt and 350 Tunisia flows. Out of these chosen flows, we extracted a list of users, whom we then classified into actor types (discussed in the next section).
To summarize how we arrived at the chosen flows, we did the following: 1) classified tweets that were very similar into bins, 2) sorted bins by size (number of tweets included), 3) chose the top 10%, and then 4) randomly chose one-sixth of them to identify a total of 850 flows which we would analyze in more detail.
Classifying Actor Types
There is a growing body of literature investigating how to quantify influence on Twitter, most prominently via in-degree (number of followers), retweets, mentions, and TunkRank (takes into account followers of followers. In order to determine how information flows between actor types, we had to cut down the number of users that would be hand coded. We selected 963 users total, from both the Egypt and Tunisia datasets, who either were first to post in a flow, or were retweeted or mentioned at least 15 times. Of these 963, 774 were part of our Tunisia dataset, and 888 were present in our Egyptian dataset; 699 (or 73%) of the actors we coded were involved to some extent in both datasets. We developed a classification schema based on the following types of actors, which was refined through several phases of coding:
- Mainstream media organizations (“MSM”): news and media organizations that have both digital and non-digital outlets.
- Mainstream new media organizations: blogs, news portals, or journalistic entities that exist solely online.
- Non-media organizations (“non-media orgs”): groups, companies, or organizations that are not primarily news-oriented.
- Mainstream media employees (“journalists”): individuals employed by MSM organizations, or who regularly work as freelancers for MSM organizations.
- Bloggers: individuals who post regularly to an established blog, and who appear to identify as a blogger on Twitter.
- Activists: individuals who self-identify as an activist, who work at an activist organization, or who appear to be tweeting purely about activist topics to capture the attention of others.
- Digerati: individuals who have worldwide influence in social media circles and are, thus, widely followed on Twitter.
- Political actors: individuals who are known primarily for their relationship to government.
- Celebrities (“celebs”): individuals who are famous for reasons unrelated to technology, politics, or activism.
- Researchers: an individual who is affiliated with a university or think-tank and whose expertise seems to be focused on Middle East issues.
- Bots: accounts that appears to be an automated service tweeting consistent content, usually in extraordinary volumes.
- Other: accounts that do not clearly fit into any category.
To allow multiple coders to hand-sort the 963 users into one of the above actor types, we built a browser-based coding tool that displayed the stored user profile data. Coders determined each user’s actor type by looking at their stored profile data, current Twitter profile and latest tweets, and any websites they linked to in their profile. Coders could also search the Internet for a user’s given name or handle to find personal websites, LinkedIn profiles, or bylines on news websites to help determine actor type. The first round of coding involved two different coders classifying each Twitter user. When coders disagreed on a user’s categorization, that user went through a second round of classification that required a third coder to choose. Finally, we were left with 42 users (4%) that still had three different actor types. These were coded through in-person consensus building. Four of the authors contributed to all rounds of the coding process.
Methodological Limitations
Limitations of our methodology fall into three categories: data representativeness, actor type edge cases, and selection bias.
First, the raw datasets do not contain all relevant Twitter messages: they lack tweets outside of the dates we studied, tweets that did not contain popularly used keywords, and those unavailable through the public Twitter API, including Tweets contain no curated topic-based metadata, so it is difficult to know whether the search terms used to collect the tweets are, in fact, related to a tweet’s content; 2) the API only returns the 1,500 most recent tweets, so when we queried Twitter every 5 minutes, we missed any tweets beyond the 1,500 most recent tweets from the preceding 5 minutes; and
3) In situations where Twitter is being used heavily, the platform’s own internal latency results in some tweets simply being missed without any indication by the API.
By Alula Berhe Kidani, 23/06/2012








