_____ _ __ _ __ __ _ _ |_ _| | | / /(_) \ \ / _| | | (_) | | _ __ _ _ ___ | |_ / / _ _ __ \ \ | |_ ___ _ __ _ __ ___ __ _ | |_ _ ___ _ __ | || '__|| | | |/ __|| __| < < | || '_ \ > > | _|/ _ \ | '__|| '_ ` _ \ / _` || __|| | / _ \ | '_ \ | || | | |_| |\__ \| |_ \ \ | || | | | / / | | | (_) || | | | | | | || (_| || |_ | || (_) || | | | \_/|_| \__,_||___/ \__| \_\|_||_| |_|/_/ |_| \___/ |_| |_| |_| |_| \__,_| \__||_| \___/ |_| |_|
Workshop during IMPAKT Festival
Sunday 28th of October
13:30h - 15:30h
impakt.nl/nl/festival/programme/workshops/trust-formation
Please use this document for your further reading only. Due to the sensitive nature of the interview, we have yet to confirm details with Amir before making it public.
In the wake of the current media climate, we find ourselves turning to Wikipedia as a reliable source of information. This might be surprising for someone who has known Wikipedia since its early beginnings, when it was seeding doubt in reputable academic publications. Even in high schools, Wikipedia was considered as a non-reliable source and its fresh in our memories how we were not allowed to quote from or refer to the platform. Meanwhile, the question of article quality, ‘objectivity’ or vandalism on Wikipedia has been a discussion that is still ongoing, and becomes especially challenging when the daily edit count reaches half of a million.
However, with the tireless help of volunteers from the Wikimedia community, ‘all bugs are shallow’1. Especially when you add machine learning algorithms to the mix.
Machine learning can be regarded as the ability of a computer programme to spot patterns in large sets of data using advanced statistical models. Wikpedia uses machine learning to built forms of automation that help editors with the assesment of new edits, unfinished articles or acts of vandalism. One of the reasons why we decided to focus on the machine learning services that Wikimedia is working on is because of the way it presents itself: ORES, the project under consideration today, is declared by Aaron Halfaker as a feminist inspired project. We became curious to dive deeper into the project, to explore what this feminist approach contains, how it deals with consensus making and what it could teach us when thinking about ethical ways to do machine learning.
Last year, in November 2017, we met Amir Sarabadani, who is one of the developers of ORES working at Wikimedia Deutschland. We invited him in the context of an algorithmic literature event in Brussels, where we asked him to speak about ORES. It was our first introduction to the project.
In the sidelines of that event, Cristina Cochior and Femke Snelting interviewed Amir. We propose to read a part of the interview together, where Amir speaks about objectivity and vandalism within Wikipedia.
Cristina: I thought it was interesting that yesterday in your talk you referred to the concepts of ‘subjective’ and ‘objective’. You said that the assessment of vandalism is subjective, because it comes down to the personal interpretation of what vandalism is, but then you referred also to the objectivity principle on which Wikipedia is based. You seemed to view these two concepts as coexisting on the same platform. Did I read that correctly?
Amir: Well, the thing about Wikipedia, especially the policies, is that it’s not very objective. It’s very open to interpretations and it’s very complicated. I don’t know if I told you but there is a law of Wikipedia that says ignore all rules. It means do everything you think is correct and if there’s a problem and you’re violating anything, it could be that you come to a conclusion that maybe we should change that law. It happens all the time. If there’s a consensus, it can be changed. There is a page on Wikipedia that’s called Five Pillars and the five pillars say that except these five pillars, you can change everything. Although I don’t think that’s very objective, everything is subjective on Wikipedia. But when there is an interaction of lots of people, it becomes more natural and objective in a way because there is a lot of dicussion and sometimes there are people who try to change others’ opinions about some issues. When this happens, it makes everything more neutral. In another way, [the sounds are hard to distinguish at this point] by making battles, they both fight and the result is something neutral, which to some degrees is not great, but…
Cristina: Do you think that the result is aiming to be neutral?
Amir: The result is aiming to be neutral. And it is, because of the integration of lots of people that are cooperating with each other and who are trying to get things done in a way that doesn’t violate policies. So they tolerate things that they don’t like in the article or sometimes they even add them [themselves] to make it more neutral.
Femke: Could you give any examples of that?
Amir: The biggest problem is usually writing about religion. I have seen people who are against a religion and try to make a criticism and when they are writing the article they try to add something in there, like a defence of Muslims, in order to make it more neutral. The people who contribute usually value these pillars, including the pillar of neutrality.
Femke: I was wondering – when you were speaking – you could say there is vandalism that is not targeted, that is about ‘can I write something’, but there is also vandalism that is about breaking something to signal disagreement or irritation with a certain topic. Do you ever look at the relation between where the vandalism goes and what topics are being attacked?
Amir: Well, I didn’t, but there are lots of topics about that and things that people have strong feelings about are always good targets for vandals. There is always vandalism around things that have strong politics. It can be sports, it can be religion, it can be any sensitive subject like homosexuality, abortion, in these matters it happens all the time. One thing that I think about is that sometimes when people are reading articles on Wikipedia, sometimes it’s outside of their comfort zone, so they try to change the article and bring it back in, instead of expanding it.
Cristina: I was wondering actually about this sense of ownership that some editors have over their articles. A lot of reverts and debates that are happening behind the scenes of a specific article is due to the fact that one person started creating the page and put a lot of effort in it and someone else wants to implement some changes with which the first person does not agree. Do you think that the fact that there is only one face of Wikipedia is related to that? If you would have the possibility to have multiple readings of one page, then there would also be more views on one subject.
Amir: There have been lots of debates about this. I don’t know if you know Vox, the media company from the United States? One thing that they tried to implement was to make a version of Wikipedia that is customisable. For example, if you are pro-Trump, you are given a different article than someone who is a democrat. But immediately you can see the problem, it diverges people. Like what Facebook is doing right now, making people live inside their bubbles. I think this is the reason why the people on Wikipedia are fighting against anything that has this divisive effect.
Femke: That’s one way of seeing it, I understand. If you would make multiple wikis, you would support, and that’s happening a lot, that separation of world views that is algorithmically induced. I understand why that raises concern. But on the other hand, there is somehow the self-defined need to always come to a consensus. I’m wondering if this is always helpful for keeping the debate alive.
Amir: For Wikipedia, they knew that consensus is not something you can always reach. They invented a process called Conflict Resolution. When people talk and they see that they cannot reach any consensus, they ask for a third-party opinion. If they couldn’t find any agreement with the third-party opinion, they call for the mediator. But mediators do not have any enforcement authority. If mediators can resolve the conflict, then it’s done, otherwise the next step is arbitration. For example the case of Chelsea Manning. What was her name before the transitioning? I think it’s Brandon Manning, right? So, there was a discussion over what they name on Wikipedia should be: Chelsea Manning or Brandon Manning. So there was lots of transphobia in the discussion and when nothing worked, it went to an ArbiCom (Arbitration Committee). An arbitration committee is like a very scary place, it has a court and they have clerks that read the discussion and the outcome was obviously that it should stay Chelsea Manning. It’s not like you need to reach consensus all the time, sometime consensus will be forced on you. Wikipedia has a policy saying “Wikipedia is not”. One of the things Wikipedia is not is a place for democracy.
Femke: The Chelsea Manning case is interesting, I didn’t think about it. Is this - let’s say - verdict archived in the article somewhere?
Amir: The cases usually happen on the ArbiCom page and on the case as such’s page. But finding this in the discussion is hard.
In the past, Wikipedia has created a divide between researchers in regards to how trustable it is as a reference. The comparison between Wikipedia and Encyclopedia Britannica was often made and still is to this day. See for example this recently published article by the Harvard Business School which looks at a selection of 4000 articles and concludes that there is a bias on the English Wikipedia towards the US Democratic party, but this bias is not very strong. Although research like this still needs to be looked deeply into (while it is a fact that the majority of the English Wikipedia editors come from the US, there are plenty of other foreign editors who do not relate to the Democrat/Republican divide), it shows that doubt is still cast on the encyclopedia. There’s even a page on Wikipedia that documents the many times the site has been called into question2.
However, in the last few years since the linguistic term ‘fake news’ has received increasing coverage, Wikipedia has received a new image. Now it is also pictured as a trustable knowledge platform. See for example the titles of the following articles:
The discussion around this topic has also reached Wikipedia editors. There is already a page that collects all the fake news websites that the editors encounter: https://en.wikipedia.org/wiki/List_of_fake_news_websites. The volunteers also reached an agreement to ban the pages DailyMail3 and Breitbart4 from being used as a reference in the articles. However, as it happens with large-scale organisation, this decision was taken by a small group and it might take a while until it travels across the whole community.
Wikipedia is still in the top-10 of most visited websites in the world, which says a lot about the visibility and influence of the project. If, for example, someone decides that they want to make a practical joke, and they change the capital of Bulgaria to Despacito, this will in turn prompt Siri to adopt this false idea5.
Wikipedia’s governing structure is often compared to democratic principles. However, the heavy bureaucratic structures of Wikipedia are hierarchical and the overall goal of the project is to reach consensus, not to follow the majority’s opinion. The case of the Chelsea Manning page and the Arbitration Committee is a good example of this. The Wikipedia page What Wikipedia is not describes how Wikipedia is not a democracy:
‘Wikipedia is not an experiment in democracy or any other political system. Its primary (though not exclusive) means of decision making and conflict resolution is editing and discussion leading to consensus—not voting (voting is used for certain matters such as electing the Arbitration Committee). Straw polls are sometimes used to test for consensus, but polls or surveys can impede, rather than foster, discussion and should be used with caution’.
Wikipedia’s strive for objectivity and a Neutral Point of View, which is the general guideline for Wikipedia editors, shows how not all content is accepted. This has sparked a lot of debate and backlash from critics, but also fringe actors. As a result, groups of people with fringe political positions who did not feel represented by Wikipedia have decided to make their own. Some examples:
Now, we could do a little Chelsea Manning test. Where we look for a page about Chelsea Manning on each of these wikis, to see what they decided as the name of the article.
But how does a website that is open to editing float above the fake news waters? Wikipedia has a lot of systems in place that attempt to identify damaging intentions to the site. For example, there are ways to limit the ‘editability’ of a page only to editors who have had an account for a certain number of years, or to editors who have a higher status (eg. administrators). There are also multiple types of machine learning algorithms in place that are mobilised to detect vandalist tendencies. We can mention two: CluebotNG and ORES.
Profile image of CluebotNG. Source: https://en.wikipedia.org/wiki/User:ClueBot_NG
CluebotNG is a machine learning programme using neural networks to identify and revert vandalist edits on Wikipedia. It was made and maintained by Christopher Breneman (Crispy1989), Tim1357, and Jacobi Carter (Cobi). It was filed to become a bot account on Monday October 25, 2010. Since it has been approved, it has been active on the English Wikipedia.
CluebotNG raised a heated discussion when it was first proposed as an alternative way to fight vandalism. The community opposed the initial parameters that allowed the algorithm to catch more vandalist edits, while at the same time generating many side-effects, such as false positives.
Currently the bot is very popular with the community, receiving a lot of praise for being very efficient in fighting vandals. However, some might argue that despite its popularity, CluebotNG is driving newcomers away through its categorical decision making (something is either vandalism or not, and if it is, the edit will be reverted directly).
Diagram that is used on the ORES page to illustrate the project. Source: https://www.mediawiki.org/wiki/ORES#/media/File:ORES_edit_quality_flow.svg
what is it?
ORES (which stands for Objective Revision Evaluation Service) is a feminist machine learning service developed at the Wikimedia Foundation, the non-profit organisation that hosts Wikipedia and other free knowledge projects. The project is developed to maintain the quality of Wikipedia on the big scale that it is at right now: currently Wikipedia is edited half a million times per day. To empower its volunteers in the processing of all these edits, ORES is build to both make quality control more efficient and to make Wikipedia a more welcoming place for new editors. Wikipedia, especially the English one, is considered to be a hostile environment to newcomers: very often when users are not directly complicit to the guidelines of Wikipedia, their first edits will be reverted.
Speedy deletion anecdote: https://en.wikipedia.org/wiki/User_talk:Clco
By highlighting edits that need review, we can reduce the overall reviewing workload of our volunteers by a factor of 10. This turns a 270 hours per day job into a 27 hours per day job. This also means that Wikipedia could grow by 10 times and our volunteers could keep up with the workload. https://wikimediafoundation.org/2018/10/10/mitigating-biases-in-artificial-intelligences-the-wikipedian-way/
ORES is not a machine learning product, like for example Siri or Google Translate. Rather, it is a machine learning service. This means that ORES is not built to perform a specific task at a specific place. Instead, it provides results as data endpoints (so called API, an Application Programming Interface) that other projects and tools can use. It curates and highlights, but doesn’t revert any edits that are made. It only provides the machine learning calculations, in order to stimulate and support many other tools to be built on top of them.
An example API request to ORES requires an input (a so called revid number, that represents a specific edit) and returns a list of numbers as output (structured in the JSON data format). You can choose what kind of prediction you would like to get back, by choosing a specific model (Edit quality or Article Quality). A full API request looks like the following: https://ores.wmflabs.org/v3/scores/enwiki/?models=draftquality%7Cwp10&revids=34854345%7C485104318.
One thing that makes ORES different from many other machine learning projects, is that they chose to not focus on profiling the user. Instead of rating the edits that an editor makes on the basis of all the previous work that the editor has done, the ORES team decided to only work with information that comes from the edits themselves, or the activities around the edits, such as deletion. ORES looks at four different aspect of the editing process:
Edit Quality
Article Quality
More detailed information about the different type of models can be found here: https://www.mediawiki.org/wiki/ORES.
who is behind it?
The service is developed by the Wikimedia Scoring Platform team, which currently exists out of 6 people:
Aaron Halfaker (Principal Research Scientist, Team Lead)
Amir Sarabadani (Software Engineer (WMDE))
Adam Wight (Software Engineer (WMF))
James Hare (Associate Product Manager (WMF))
Max Klein (Software Engineer (WMF))
Marius Hoch (Software Engineer (WMDE))
More information about the team can be found here: https://www.mediawiki.org/wiki/Wikimedia_Scoring_Platform_team
how does it process edits?
To predict if the quality of a new Wikipedia edit is written in goodfaith and is damaging or not, ORES needs to work with features. Features are information points that can be extracted from the comments or their surroundings, that function as the informative data that ORES uses to make its calculations.
For example, not only the content of the edit itself, but also the short summaries that editors attach to their edit are regarded as features. Just as information about the type of article in which the article is made, or even the section or element in which the edit appeared.
The following code from the Wikimedia Github shows what type of features are used to calculate Edit Quality.
damaging = wikipedia.page + \ wikitext.parent + wikitext.diff + mediawiki.user_rights + \ mediawiki.protected_user + mediawiki.comment + \ badwords + informals + dict_words
On the following pages, you can further inspect how features are defined:
Disclaimer: this is something we are still trying to grasp.7
why is it a feminist endeavor?
In the initial blogpost that Aaron Halfaker wrote to introduce ORES, Aaron mentions how the design of ORES is based on feminist principles.89
Some other reasons why ORES is considered feminist in its approach:
Infrastructural approach: One of the main ideas behind ORES was to not built a ‘full stack’ machine learning project. It is deliberately designed as a service that other developers can build on top of. By doing this, the threshold of making new tools or utilities is much lower. As the groundwork of gathering and labelling data, as well as training the machine learning models, has already been achieved and taken care of.
Transparency: as opposed to other machine learning algorithms, the decisions that have led to the making of the models are openly available on the many pages that have documented the project. Despite the labyrinthic expansion of the documentation, that can be quite confusing to navigate, the members of the team are very fast to respond to queries and try to include as many volunteers in the process as possible through participation at Wikimedia organised hackathons.
Refusal to profile: Cluebot NG profiles the editors by looking up whether the user is anonymous. And if so, where the IP address of the user is coming from, or the time at which the edit is made. In contrast, ORES specifically refers only to the quality and intention of the text itself.
Newcomer friendliness: the reason why ORES began in the first place was as a response to the amount of newcomers who felt discouraged to participate in the editing process by anti-vandalist bots, such as Cluebot NG, or the hostile attitude of some long-time editors.
Participation in the decision making process: Allow editors to play an active role in the algorithm’s mechanisms. The WikiLabels tool was specially made to invite editors to train the Edit Quality model. This also became a useful tool in understanding the decision making process of ORES better. Another tool that is currently developed by the Scoring Platform team is JADE, a Mediawiki extention that can be used by editors to annotate their editing work. This information is then connected to the ORES workflow, to create a feedback loop that re-trains ORES on the basis of the latest editing work.
First we will start with a 15 minutes exercise in which we will be introduced to the WikiLabels system. WikiLabels is a human computing service for Wikipedia, developed by the ORES developers to involve Wikipedia users into the proces of validating ORES results. We will be validating a set of edits on two parameters: good faith and damaging.
We will start by reading a little bit more from the interview with Amir, in which he starts to speak about the ideas behind good faith and damaging edits.
Amir: The thing we are trying to tackle in terms of Wikipedia editing, we are trying to make a model not just in terms of binary separation. We have a good faith model, which predicts with the same system between one and zero that an edit has been made in good faith or not. For example, if you see if an edit is damaging, but it was made with a good intention. You see many people that want to help, but because they are new, they make mistakes. We try to tackle this by having another model. So if an edit has both a high vandalist score and a high bad intent score, we can remove it with bots and we can interact with people who make mistakes but have a good intention.
Cristina: And how do you see the good faith principle in relation to neutrality?
Amir: I think it’s completely related and I think it comes down to is this user trying to help Wikipedia or not: this is our brainstorm.
Femke: If you talk about the distinction between good faith and bad faith, it is still about faith-in-something. If you plot the faith according to the line of a neutral point of view, you’re dealing with a different type of good faith and goodness than if you plot the faith along the line of wanting more points of view.
Amir: I see. I think good faith means good intent. By defining what is ‘good’ in this way, we are following the principles of the whole Wikipedia, good is helping people. Although, it is a very subjective term, and what we are trying to do right now is to make some sort of survey. To take out things that are very computative and can’t be measured easily, like quality, and ask people whether they think an edit looks good or bad. To make things more objective, to make things come together from the integration of observations of lots of people. Obviously, there are a lot of gray areas.
Exercise:
For the purpose of the workshop, we’ve written a script that will use ORES to rate a Wikipedia article. We will be writing the article together. The article topic is chosen by us as a group. ORES will be rating it using the following two parameters:
Amir described 3 types of vandalists: newcomers, cyber-warriors (push their agendas/say messages), for fun. We will be working with the first two types of vandalists. One half of the group will be writing as a newcomer, the other one as a cyber warrior.
To start:
! There will be a timekeeper who keeps an eye on the time and warns you when a minute has passed !
! If you get a badfaith you are out !
$ curl https://pad.vvvvvvaria.org/trust%3Cin%3Eformation.css/export/txt > stylesheet.css && curl https://pad.vvvvvvaria.org/trust%3Cin%3Eformation/export/txt | pandoc -f markdown -t html --toc -H stylesheet.css -s -o reader.trust-in-formation.html && rm stylesheet.css
“With enough eyes, all bugs are shallow.” - Linus’ Law mentioned by Eric S. Raymond in The Cathedral And The Bazaar (1999)↩
https://en.wikipedia.org/wiki/Reliability_of_Wikipedia↩
Article on The Guardian relating the story: https://www.theguardian.com/technology/2017/feb/08/wikipedia-bans-daily-mail-as-unreliable-source-for-website. How the Daily Mail saw this ban: https://www.dailymail.co.uk/news/article-4280502/Anonymous-Wikipedia-activists-promote-warped-agenda.html↩
How Breitbart received the news: https://www.breitbart.com/tech/2018/10/03/breitbart-blacklisted-from-use-on-wikipedia-as-reliable-source/↩
This has actually happened: https://www.reddit.com/r/softwaregore/comments/74epbw/siri_thinks_the_national_anthem_of_bulgaria_is/ Thanks goes again to Amir for telling us about this great case.↩
Alexis Sobel Fitts (2017) Welcome to the Wikipedia of the Alt-Right, in Wired https://www.wired.com/story/welcome-to-the-wikipedia-of-the-alt-right/↩
Many thanks to Amir for his patience to guide us through the documentation.↩
Blogpost on the Wikimedia Foundation website in which ORES is initially released: Aaron Halfaker & Dario Taraborelli (2015) Artificial intelligence service “ORES” gives Wikipedians X-ray specs to see through bad edits https://wikimediafoundation.org/2015/11/30/artificial-intelligence-x-ray-specs/↩
Designing The Numbers That Govern Wikipedia: Aaron Halfaker on Machine Learning in Large-Scale Open Production https://civic.mit.edu/2016/02/05/designing-the-numbers-that-govern-wikipedia-aaron-halfaker-on-machine-learning-in/↩