Automated event extraction tools for the study of the far-right

English Defence League members march in 2016. Photo from PA/PA Archive.

Protest event analysis (PEA) is a method of quantitative content analysis with a long tradition in the study of collective mobilization. Despite its limitations,[i] it offers social scientists the opportunity to document several characteristics of protest over time and across different geographical locations. PEA defines the protest event and not the individual case study as the unit of analysis, providing a way of measuring more systematically the broader repertoire of contention and the relationship between social movements and their environment. In the context of far-right politics, PEA can enrich our understanding of this heterogeneous phenomenon, which appears to have regained a foothold in terms of street actions. Besides, it has been recently suggested that research on the far-right needs to move beyond the boundaries of the electoral arena, taking into account developments that also occur in the protest arena.[ii]

PEA has traditionally, albeit not exclusively, relied on the use of newspapers as a source of data collection, as they are more likely to provide information on the variables of interest. However, this process can be labor intensive if hand-coding is chosen as the preferred way of turning words into data. Instead, researchers can benefit from advances in computer science, deploying automated coding methods that facilitate the extraction of protest events. Automation enhances transparency and allows the use of multiple sources, which in turn may address long-standing limitations associated with protest event data more efficiently, such as the issue of selection bias.

Therefore, this CARR blog presents three open source programs, MPEDS, PETRARCH2, and Giveme5W1H, that, despite their different structure, can help researchers study the extra-institutional actions of the far-right. It is worth noting that the analytic power and utility of these tools can only be shown if they are adjusted to the requirements of each project. However, since their components are written in the Python programming language, familiar to many social scientists, the efforts needed to customize them is relatively small.

Machine-Learning Protest Event Data System (MPEDS)

MPEDS is an automated coding mechanism that has been designed specifically for the generation of protest event data. Oliver and Hanna,[iii] for instance, have used it to explore the activities of the Black movement in the USA during the period 1994-2010. The internal functionality of the system is based on a combination of natural language processing and machine learning, and requires limited human intervention to produce outcomes. Only in a pre-processing step, human coders classify news into protest event variables that are later used, as training material, by MPEDS to teach itself how to probabilistically pick out similar patterns in news text. As a consequence, the quality of the final results depends on the accuracy of the input that feeds it. But once the training process is complete, a few lines of code suffice to initiate the program.

MPEDS is packaged and distributed in two docker containers, one for the CLIFF geolocation program and a second container for the actual coding of news articles. At its current development stage, it has been trained to obtain data on the form of action, issue, target, social movement organization, location, and size of the event. Although there remain areas in social movement literature that cannot be put to the test with this information, MPEDS can shed light on vital questions that may refer, for instance, to the determinants of the trajectory of protest cycles or the dominant themes of a protest campaign. Equally important, of course, is its efficiency to process large amount of text within minutes.

Python Engine for Text Resolution And Related Coding Hierarchy (PETRARCH2)

PETRARCH2 belongs to a set of software tools that include geolocation programs, coders, and dictionaries, among others, and have been developed by members of the Open Event Data Alliance (OEDA) in order to collect event data (a superset of protest event data). PETRARCH2 is an automated coder that is based on a completely different logic compared to MPEDS, since its behavior, i.e. the decision what constitutes an event and what the main protagonists are, is determined by instructions written in predefined ontologies. It looks for events at the sentence level and according to the authors of TERRIER, a recently published dataset that utilizes PETRARCH2 and covers a time span of approximately 40 years (1979-2016) of events taking place around the world, the coder “locates actors, events, and targets in text, compares them to dictionaries that map short phrases to actor and event codes, and returns a complete event.”

An additional feature of PETRARCH2 is its flexibility; it can be used either as a standalone program or as part of a pipeline[iv] together with other utilities, such as web scrapers and the Mongo database, necessary for the extraction of event data. However, one should keep in mind that PETRARCH2 may need to be modified and adapted to the distinctive features of the case under examination. If you’re interested, for example, in the activities of newly formed far-right groups in the UK, you should update the list of actors; otherwise, the coder will not include the events in the dataset. PETRARCH2 also works with different ontologies, allowing researchers to have more options when they design their research project. By default, it uses the Conflict and Mediation Event Observations (CAMEO) ontology, which is a prevalent framework among coding systems of this type today, but the OEDA has created the Political Language Ontology for Verifiable Event Records (PLOVER), with the aim to replace CAMEO.


Giveme5W1H is not a product of the social movement community; it has been devised to represent a new tool that collects information on the five W’s, namely ‘Who did What, When, Where, and Why,’ but also on ‘How they did it,’ and to describe the main event in a news article. Below is an example from the development page of Giveme5W1h that shows how it works (in bold the six variables):

Taliban (who) attacks German consulate (what) in northern Afghan city of Mazar-i-Sharif (where) with truck bomb (how)

            The death toll from a powerful Taliban truck bombing at the German consulate in Afghanistan’s Mazar-i-Sharif city rose to at least six Friday, with more than 100 others wounded in a major militant assault.

The Taliban said the bombing late Thursday (when), which tore a massive crater in the road and overturned cars, was a “revenge attack” (why) for US air strikes this month in the volatile province of Kunduz that left 32 civilians dead. […]

Despite the fact that the underpinning logic of Giveme5W1H and the way the results are structured differ from the previous two approaches, given that there are no ontologies to represent the events with clearly defined codes, it can be a useful addition to the current list of methods for the study of various political phenomena. It could be used, for example, together with other coding techniques as a reference point for comparisons or, according to its creators, in frame analysis if one looks at how media actors report on the protest activities of far-right groups.

To conclude, coding news text is demanding and constitutes only one of the many steps that researchers have to go through before they collect protest event data. Other steps that require careful thought also include the conceptualization of the unit of analysis, the identification of relevant sources and news articles, and data cleaning. Machine-assisted coding, if conducted properly, can make a difference and a significant contribution to this endeavor. It saves time, but, most importantly, ensures that the data can be replicated. It is not flawless however; ontologies, like CAMEO, are written for the study of international interactions and not specifically for social movements, the coders may fail to catch event properties that the human eye would probably not miss, and false positives may come up. This is the reason why social scientists should be aware of the weaknesses of the above approaches and take measures to mitigate them; otherwise, these biases are likely to undermine the quality of the final datasets.

Mr Andreas Dafnos is an Early Career Fellow at CARR and a doctoral candidate at the University of Sheffield. His profile can be found here:

© Andreas Dafnos. Views expressed on this website are individual contributors’ and do not necessarily reflect that of the Centre for Analysis of the Radical Right (CARR). We are pleased to share previously unpublished materials with the community under creative commons license 4.0 (Attribution-NoDerivatives).

[i]           Hutter, S. (2014). Protest Event Analysis and Its Offspring. In D. della Porta (Ed.), Methodological Practices in Social Movement Research (pp. 335–367). Oxford University Press.

[ii]          Castelli Gattinara, P., & Pirro, A. L. P. (2018). The far right as social movement. European Societies, 1–16.

[iii]         Oliver, P., & Hanna, A. (2017). Black Protest in US News Wire Stories 1994-2010: Voices From the Doldrums.

[iv]         For a tutorial on how to use the pipeline, see here

Related Posts

Right-Wing Nationalism Has Become A Collaborative International Enterprise When analyzing the global collaboration between radical right groups, one needs to look no further than the Trump phenomenon, Brexit, and 20th-century...
Catalyst or Catharsis: Inside the DFLA Secret Facebook Groups Supporters of the Democratic Football Lads Alliance at Southwater shopping centre (Source: Shropshire Star) In October 2018, the Democratic Footba...
PEGIDA turns 4 – will AfD be among the well-wishers? 01.09.2018, Saxony, Chemnitz: Lutz Bachmann, founder of PEGIDA, makes a selfie in front of a photo of the murdered Iulia from Viersen during AfD demo...
Concept structures and the far right The far right has often been described as a concept that is hard to define. Although there is still controversy around its three basic components, i.e...