data-workers-publication/data-workers.en.html

579 lines
147 KiB
HTML
Raw Normal View History

2019-03-25 08:35:09 +01:00
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>Data Workers</title>
<!-- <link rel="stylesheet" href="stylesheet.css"> -->
</head>
<body>
<section class="language en"><p><br/>
Data Workers, an exhibition at the <a class="external text" href="http://www.mundaneum.org/en" rel="nofollow">Mundaneum</a> in Mons from 28 March until 29 April 2019.
</p><p>The <b>opening</b> is on <b>Thursday 28 March from 18:00 until 22:00</b>. As part of the exhibition, we have invited <b><a class="external text" href="https://www.decontextualize.com/" rel="nofollow">Allison Parrish</a></b>, an algoliterary poet from New York. She will give a <b>talk</b> in <a class="external text" href="https://www.passaporta.be/en" rel="nofollow">Passa Porta</a> on Thursday evening 25 April and a <b>workshop</b> in the Mundaneum on Friday 26 April.
</p>
<h2 id="about"><span class="mw-headline" id="About">About</span></h2>
<p>Data Workers is an <b>exhibition of algoliterary works</b>, of stories told from an algorithmic storyteller point of view. The exhibition was created by members of Algolit, a group from Brussels involved in artistic research on algorithms and literature. Every month they gather to experiment with F/LOSS code and texts. Some works are by students of Arts² and external participants to the workshop on machine learning and text organized by Algolit in October 2018 at the Mundaneum.
</p><p>Companies create <b>artificial intelligence (AI) systems</b> to serve, entertain, record and learn about humans. The work of these machinic entities is usually hidden behind interfaces and patents. In the exhibition, algorithmic storytellers leave their invisible underworld to become interlocutors. The data workers operate in different <b>collectives</b>. Each collective represents a stage in the design process of a machine learning model: there are the Writers, the Cleaners, the Informants, the Readers, the Learners and the Oracles. The boundaries between these collectives are not fixed; they are porous and permeable. At times, Oracles are also Writers. At other times Readers are also Oracles. Robots voice experimental literature, while algorithmic models read data, turn words into numbers, make calculations that define patterns and are able to endlessly process new texts ever after.
</p><p>The exhibition <b>foregrounds data workers</b> who impact our daily lives, but are either hard to grasp and imagine or removed from the imagination altogether. It connects stories about algorithms in mainstream media to the storytelling that is found in technical manuals and academic papers. Robots are invited to engage in dialogue with human visitors and vice versa. In this way we might understand our respective reasonings, demystify each other's behaviour, encounter multiple personalities, and value our collective labour. It is also a tribute to the many machines that <a class="external text" href="https://en.wikipedia.org/wiki/Paul_Otlet" rel="nofollow">Paul Otlet</a> and <a class="external text" href="https://en.wikipedia.org/wiki/Henri_La_Fontaine" rel="nofollow">Henri La Fontaine</a> imagined for their Mundaneum, showing their potential but also their limits.
</p>
<section class="group"><section class="lemma contextual-stories-about-algolit stories"><h3 class="lemmaheader" id="contextual-stories-about-algolit">Contextual stories about Algolit</h3><div class="toc" id="toc"><div id="toctitle"><h2 id="contents">Contents</h2></div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Why_contextual_stories.3F"><span class="tocnumber">1</span> <span class="toctext">Why contextual stories?</span></a></li>
<li class="toclevel-1 tocsection-2"><a href="#We_create_.27algoliterary.27_works"><span class="tocnumber">2</span> <span class="toctext">We create 'algoliterary' works</span></a></li>
<li class="toclevel-1 tocsection-3"><a href="#What_is_literature.3F"><span class="tocnumber">3</span> <span class="toctext">What is literature?</span></a></li>
<li class="toclevel-1 tocsection-4"><a href="#An_important_difference"><span class="tocnumber">4</span> <span class="toctext">An important difference</span></a></li>
</ul>
</div><h2 id="why-contextual-stories"><span class="mw-headline" id="Why_contextual_stories.3F">Why contextual stories?</span></h2><p>During the monthly meetings of Algolit, we study manuals and experiment with machine learning tools for text processing. And we also share many, many stories. With the publication of these stories we hope to recreate some of that atmosphere. The stories also exist as a podcast that can be downloaded from <a class="external free" href="http://www.algolit.net" rel="nofollow">http://www.algolit.net</a>.
</p><p>For outsiders, algorithms only become visible in the media when they achieve an outstanding performance, like Alpha Go, or when they break down in fantastically terrifying ways. Humans working in the field though, create their own culture on and offline. They share the best stories and experiences during live meetings, research conferences and annual competitions like <a class="external text" href="https://www.kaggle.com/" rel="nofollow">Kaggle</a>. These stories that contextualize the tools and practices can be funny, sad, shocking, interesting.
</p><p>A lot of them are experiential learning cases. The implementations of algorithms in society generate new conditions of labour, storage, exchange, behaviour, copy and paste. In that sense, the contextual stories capture a momentum in a larger anthropo-machinic story that is being written at full speed and by many voices.
</p><h2 id="we-create-algoliterary-works"><span class="mw-headline" id="We_create_.27algoliterary.27_works">We create 'algoliterary' works</span></h2><p>The term 'algoliterary' comes from the name of our research group Algolit. We have existed since 2012 as a project of <a class="external text" href="http://constantvzw.org" rel="nofollow">Constant</a>, a Brussels-based organization for media and the arts. We are artists, writers, designers and programmers. Once a month we meet to study and experiment together. Our work can be copied, studied, changed, and redistributed under the same free license. You can find all the information on: <a class="external free" href="http://www.algolit.net" rel="nofollow">http://www.algolit.net</a>.
</p><p>The main goal of Algolit is to explore the viewpoint of the algorithmic storyteller. What new forms of storytelling do we make possible in dialogue with these machinic agencies? Narrative viewpoints are inherent to world views and ideologies. Don Quixote, for example, was written from an omniscient third-person point of view, showing Cervantes relation to oral traditions. Most contemporary novels use the first-person point of view. Algolit is interested in speaking through algorithms, and in showing you the reasoning underlying one of the most hidden groups on our planet.
</p><p>To write in or through code is to create new forms of literature that are shaping human language in unexpected ways. But machine Learning techniques are only accessible to those who can read, write and execute code. Fiction is a way of bridging the gap between the stories that exist in scientific papers and technical manuals, and the stories spread by the media, often limited to superficial reporting and myth-making. By creating algoliterary works, we offer humans an introduction to techniques that co-shape their daily lives.
</p><h2 id="what-is-literature"><span class="mw-headline" id="What_is_literature.3F">What is literature?</span></h2><p>Algolit understands the notion of literature in the way a lot of other experimental authors do: it includes all linguistic production, from the dictionary to the Bible, from Virginia Woolf's entire work to all versions of the Terms of Service published by Google since its existence. In this sense, programming code can also be literature.
</p><p>The collective <a class="external text" href="https://www.oulipo.net/" rel="nofollow">Oulipo</a> is a great source of inspiration for Algolit. Oulipo stands for Ouvroir de litterature potentielle (Workspace for Potential Literature). Oulipo was created in Paris by the French writers <a class="external text" href="https://en.wikipedia.org/wiki/Raymond_Queneau" rel="nofollow">Raymond Queneau</a> and <a class="external text" href="https://en.wikipedia.org/wiki/Fran%C3%A7ois_Le_Lionnais" rel="nofollow">François Le Lionnais</a>. They rooted their practice in the European avant-garde of the twentieth century and in the experimental tradition of the 1960s.
</p><p>For Oulipo, the creation of rules becomes the condition to generate new texts, or what they call potential literature. Later, in 1981, they also created <a class="external text" href="http://www.alamo.free.fr/" rel="nofollow">ALAMO</a>, Atelier de littérature assistée par la mathématique et les ordinateurs (Workspace for literature assisted by maths and computers).
</p><h2 id="an-important-difference"><span class="mw-headline" id="An_important_difference">An important difference</span></h2><p>While the European avant-garde of the twentieth century pursued the objective of breaking with conventions, members of Algolit seek to make conventions visible.
</p><p>'I write: I live in my paper, I invest it, I walk through it.' (Espèces d'espaces. Journal d'un usager de l'espace, Galilée, Paris, 1974)
</p><p>This quote from <a class="external text" href="https://en.wikipedia.org/wiki/Georges_Perec" rel="nofollow">Georges Perec</a> in Espèces d'espaces could be taken up by Algolit. We're not talking about the conventions of the blank page and the literary market, as Georges Perec was. We're referring to the conventions that often remain hidden behind interfaces and patents. How are technologies made, implemented and used, as much in academia as in business infrastructures?
</p><p>We propose stories that reveal the complex hybridized system that makes machine learning possible. We talk about the tools, the logics and the ideologies behind the interfaces. We also look at who produces the tools, who implements them, and who creates and accesses the large amounts of data needed to develop prediction machines. One could say, with the wink of an eye, that we are collaborators of this new tribe of human-robot hybrids.
</p></section></section>
<hr/>
<p><b>Data Workers</b> was created by Algolit.
</p><p><b>Works by</b>: Cristina Cochior, Gijs de Heij, Sarah Garcin, An Mertens, Javier Lloret, Louise Dekeuleneer, Florian Van de Weyer, Laetitia Trozzi, Rémi Forte, Guillaume Slizewicz, Michael Murtaugh, Manetta Berends, Mia Melvær.
</p><p><b>Co-produced by</b>: <a class="external text" href="http://blog.artsaucarre.be/artsnumeriques/" rel="nofollow">Arts²</a>, <a class="external text" href="http://constantvzw.org" rel="nofollow">Constant</a> and <a class="external text" href="http://expositions.mundaneum.org/en/expositions/data-workers" rel="nofollow">Mundaneum</a>.
</p><p><b>With the support of</b>: <a class="external text" href="http://www.arts-numeriques.culture.be/" rel="nofollow">Wallonia-Brussels Federation/Digital Arts</a>, <a class="external text" href="https://www.passaporta.be/en" rel="nofollow">Passa Porta</a>, UGent, <a class="external text" href="https://www.uantwerpen.be/en/faculties/faculty-of-arts/research-and-valoris/research-axes/digital-humanities/" rel="nofollow">DHuF - Digital Humanities Flanders</a> and <a class="external text" href="https://www.pgdp.net/c/" rel="nofollow">Distributed Proofreaders Project</a>.
</p><p><b>Thanks to</b>: Mike Kestemont, Michel Cleempoel, Donatella Portoghese, François Zajéga, Raphaèle Cornille, Vincent Desfromont, Kris Rutten, Anne-Laure Buisson, David Stampfli.
</p>
<h2 id="at-the-mundaneum"><span class="mw-headline" id="At_the_Mundaneum">At the Mundaneum</span></h2>
<p>In the late nineteenth century two young Belgian jurists, <a class="external text" href="https://en.wikipedia.org/wiki/Paul_Otlet" rel="nofollow">Paul Otlet</a> (18681944), the 'father of documentation, and <a class="external text" href="https://en.wikipedia.org/wiki/Henri_La_Fontaine" rel="nofollow">Henri La Fontaine</a> (1854-1943), statesman and Nobel Peace Prize winner, created the Mundaneum. The project aimed to gather all the worlds knowledge and to file it using the <a class="external text" href="https://en.wikipedia.org/wiki/Universal_Decimal_Classification" rel="nofollow">Universal Decimal Classification (UDC) system</a> that they had invented. At first it was an International Institutions Bureau dedicated to international knowledge exchange. In the twentieth century the <a class="external text" href="https://en.wikipedia.org/wiki/Mundaneum" rel="nofollow">Mundaneum</a> became a universal centre of documentation. Its collections are made up of thousands of books, newspapers, journals, documents, posters, glass plates and postcards indexed on millions of cross-referenced cards. The collections were exhibited and kept in various buildings in Brussels, including the <a class="external text" href="https://en.wikipedia.org/wiki/Cinquantenaire" rel="nofollow">Palais du Cinquantenaire</a>. The remains of the archive only moved to Mons in 1998.
</p><p>Based on the Mundaneum, the two men designed a World City for which <a class="external text" href="https://en.wikipedia.org/wiki/Le_Corbusier" rel="nofollow">Le Corbusier</a> made scale models and plans. The aim of the World City was to gather, at a global level, the institutions of knowledge: libraries, museums and universities. This project was never realized. It suffered from its own utopia. The Mundaneum is the result of a visionary dream of what an infrastructure for universal knowledge exchange could be. It attained mythical dimensions at the time. When looking at the concrete archive that was developed, that collection is rather eclectic and specific.
</p><p>Artificial intelligence systems today come with their own dreams of universality and knowledge production. When reading about these systems, the visionary dreams of their makers were there from the beginning of their development in the 1950s. Nowadays, their promise has also attained mythical dimensions. When looking at their concrete applications, the collection of tools is truly innovative and fascinating, but at the same time, rather eclectic and specific. For Data Workers, Algolit combined some of the applications with 10 per cent of the digitized publications of the International Institutions Bureau. In this way, we hope to poetically open up a discussion about machines, algorithms, and technological infrastructures.
</p>
<h2 id="zones"><span class="mw-headline" id="Zones">Zones</span></h2>
<h3 id="writers"><span class="mw-headline" id="Writers">Writers</span></h3>
<p>Data workers need data to work with. The data that used in the context of Algolit is written language. Machine learning relies on many types of writing. Many authors write in the form of publications, such as books or articles. These are part of organized archives and are sometimes digitized. But there are other kinds of writing too. We could say that every human being who has access to the Internet is a writer each time they interact with algorithms. We chat, write, click, like and share. In return for free services, we leave our data that is compiled into profiles and sold for advertising and research purposes.
</p><p>Machine learning algorithms are not critics: they take whatever they're given, no matter the writing style, no matter the CV of the author, no matter the spelling mistakes. In fact, mistakes make it better: the more variety, the better they learn to anticipate unexpected text. But often, human authors are not aware of what happens to their work.
</p><p>Most of the writing we use is in English, some in French, some in Dutch. Most often we find ourselves writing in Python, the programming language we use. Algorithms can be writers too. Some neural networks write their own rules and generate their own texts. And for the models that are still wrestling with the ambiguities of natural language, there are human editors to assist them. Poets, playwrights or novelists start their new careers as assistants of AI.
</p>
<h5 id="works"><span class="mw-headline" id="Works">Works</span></h5>
<section class="group"><section class="lemma data-workers-publication works"><h3 class="lemmaheader" id="data-workers-publication">Data Workers Publication</h3><p>By Algolit
</p><p>All works visible in the exhibition, as well as the contextual stories and some extra text material have been collected in a publication, which exists in French and English.
</p><p>This publication is made using a plain text workflow, based on various text processing and counting tools. The plain text file format is a type of document in which there is no inherent structural difference between headers and paragraphs anymore. It is the most used type of document in machine learning models for text. This format has been the starting point of a playful design process, where pages are carefully counted, page by page, line by line and character by character.
</p><p>Each page holds 110 characters per line and 70 lines per page. The design originates from the act of counting words, spaces and lines. It plays with random choices, scripted patterns and ASCII/UNICODE-fonts, to speculate about the materiality of digital text and to explore the interrelations between counting and writing through words and numbers.
</p><hr/><p><b>Texts</b>: Cristina Cochior, Sarah Garcin, Gijs de Heij, An Mertens, François Zajéga, Louise Dekeuleneer, Florian Van de Weyer, Laetitia Trozzi, Rémi Forte, Guillaume Slizewicz.
</p><p><b>Translations &amp; proofreading</b>: deepl.com, Michel Cleempoel, Elodie Mugrefya, Emma Kraak, Patrick Lennon.
</p><p><b>Lay-out &amp; cover</b>: Manetta Berends
</p><p><b>Responsible publisher</b>: Constant vzw/asbl, Rue du Fortstraat 5, 1060 Brussels
</p><p><b>License</b>: Algolit, Data Workers, March 2019, Brussels. Copyleft: This is a free work, you can copy, distribute, and modify it under the terms of the Free Art License <a class="external free" href="http://artlibre.org/licence/lal/en/" rel="nofollow">http://artlibre.org/licence/lal/en/</a>.
</p><p><b>Online version</b>: <a class="external free" href="http://www.algolit.net/index.php/Data_Workers" rel="nofollow">http://www.algolit.net/index.php/Data_Workers</a>
</p><p><b>Sources</b>: <a class="external free" href="https://gitlab.constantvzw.org/algolit/mundaneum" rel="nofollow">https://gitlab.constantvzw.org/algolit/mundaneum</a>
</p></section><section class="lemma data-workers-podcast works"><h3 class="lemmaheader" id="data-workers-podcast">Data Workers Podcast</h3><p>By Algolit
</p><p>During our monthly Algolit meetings, we study manuals and experiment with machine learning tools for text processing. And we also share many, many stories. With this podcast we hope to recreate some of that atmosphere.
</p><p>For outsiders, algorithms only become visible in the media when they achieve an outstanding performance, like Alpha Go, or when they break down in fantastically terrifying ways. Humans working in the field though, create their own culture on and offline. They share the best stories and experiences during live meetings, research conferences and annual competitions like Kaggle. These stories that contextualize the tools and practises can be funny, sad, shocking, interesting.
</p><p>A lot of them are experiential learning cases. The implementations of algorithms in society generate new conditions of labour, storage, exchange, behaviour, copy and paste. In that sense, the contextual stories capture a momentum in a larger anthropo-machinic story that is being written at full speed and by many voices. The stories are also published in the publication of Data Workers.
</p><hr/><p><b>Voices</b>: David Stampfli, Cristina Cochior, An Mertens, Gijs de Heij, Karin Ulmer, Guillaume Slizewicz
</p><p><b>Editing</b>: Javier Lloret
</p><p><b>Recording</b>: David Stampfli
</p><p><b>Texts</b>: Cristina Cochior, An Mertens
</p></section><section class="lemma markbot-chains works"><h3 class="lemmaheader" id="markbot-chains">Markbot Chains</h3><p>By Florian Van de Weyer, student Arts²/Section Digital Arts
</p><p>Markbot Chain is a social experiment in which the public has a direct influence on the result. The intention is to integrate responses in a text-generation process without applying any filter.
</p><p>All the questions in the digital files provided by the Mundaneum were automatically extracted. These questions are randomly put to the public via a terminal. By answering them, people contribute to another database. Each entry generates a series of sentences using a Markov chain configuration, an algorithm that is widely used in spam generation. The sentences generated in this way are displayed in the window, and a new question is asked.
</p></section></section>
<section class="group"><section class="lemma contextual-stories-about-writers stories"><h3 class="lemmaheader" id="contextual-stories-about-writers">Contextual stories about Writers</h3><div class="toc" id="toc"><div id="toctitle"><h2 id="contents">Contents</h2></div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Programmers_are_writing_the_dataworkers_into_being"><span class="tocnumber">1</span> <span class="toctext">Programmers are writing the dataworkers into being</span></a></li>
<li class="toclevel-1 tocsection-2"><a href="#Cortana_speaks"><span class="tocnumber">2</span> <span class="toctext">Cortana speaks</span></a></li>
<li class="toclevel-1 tocsection-3"><a href="#Open-source_learning"><span class="tocnumber">3</span> <span class="toctext">Open-source learning</span></a></li>
<li class="toclevel-1 tocsection-4"><a href="#Natural_language_for_artificial_intelligence"><span class="tocnumber">4</span> <span class="toctext">Natural language for artificial intelligence</span></a>
<ul>
<li class="toclevel-2 tocsection-5"><a href="#References"><span class="tocnumber">4.1</span> <span class="toctext">References</span></a></li>
</ul>
</li>
</ul>
</div><h2 id="programmers-are-writing-the-dataworkers-into-being"><span class="mw-headline" id="Programmers_are_writing_the_dataworkers_into_being">Programmers are writing the dataworkers into being</span></h2><p>We recently had a funny realization: most programmers of the languages and packages that Algolit uses are European.
</p><p><a class="external text" href="https://www.python.org/" rel="nofollow">Python</a>, for example, the main language that is globally used for Natural Language Processing (NLP), was invented in 1991 by the Dutch programmer <a class="external text" href="https://en.wikipedia.org/wiki/Guido_van_Rossum" rel="nofollow">Guido Van Rossum</a>. He then crossed the Atlantic and went from working for Google to working for Dropbox.
</p><p><a class="external text" href="https://sklearn.org/" rel="nofollow">Scikit Learn</a>, the open-source Swiss knife of machine learning tools, started as a Google Summer of Code project in Paris by French researcher <a class="external text" href="https://en.wikipedia.org/wiki/David_Cournapeau" rel="nofollow">David Cournapeau</a>. Afterwards, it was taken on by Matthieu Brucher as part of his thesis at the Sorbonne University in Paris. And in 2010, <a class="external text" href="http://www.inra.fr/en/" rel="nofollow">INRA</a>, the French National Institute for computer science and applied mathematics, adopted it.
</p><p><a class="external text" href="https://keras.io/" rel="nofollow">Keras</a>, an open-source neural network library written in Python, was developed by François Chollet, a French researcher who works on the Brain team at Google.
</p><p><a class="external text" href="https://radimrehurek.com/gensim/" rel="nofollow">Gensim</a>, an open-source library for Python used to create unsupervised semantic models from plain text, was written by <a class="external text" href="https://radimrehurek.com/about/" rel="nofollow">Radim Řehůřek</a>. He is a Czech computer scientist who runs a consulting business in Bristol, UK.
</p><p>And to finish up this small series, we also looked at <a class="external text" href="https://www.clips.uantwerpen.be/pattern" rel="nofollow">Pattern</a>, an often-used library for web-mining and machine learning. Pattern was developed and made open-source in 2012 by Tom De Smedt and Walter Daelemans. Both are researchers at <a class="external text" href="https://www.clips.uantwerpen.be" rel="nofollow">CLIPS</a>, the research centre for Computational Linguistics and Psycholinguistcs at the University of Antwerp.
</p><h2 id="cortana-speaks"><span class="mw-headline" id="Cortana_speaks">Cortana speaks</span></h2><p>AI assistants often need their own assistants: they are helped in their writing by humans who inject humour and wit into their machine-processed language. <a class="external text" href="https://www.microsoft.com/en-us/cortana/" rel="nofollow">Cortana</a> is an example of this type of blended writing. She is Microsofts digital assistant. Her mission is to help users to be more productive and creative. Cortana's personality has been crafted over the years. It's important that she maintains her character in all interactions with users. She is designed to engender trust and her behavior must always reflect that.
</p><p>The following guidelines are taken from <a class="external text" href="https://docs.microsoft.com/en-us/cortana/skills/cortanas-persona" rel="nofollow">Microsoft's website</a>. They describe how Cortana's style should be respected by companies that extend her service. Writers, programmers and novelists, who develop Cortana's responses, personality and branding have to follow these guidelines. Because the only way to maintain trust is through consistency. So when Cortana talks, you 'must use her personality'.
</p><p>What is Cortana's personality, you ask?
</p><p><br/>
'Cortana is considerate, sensitive, and supportive.
</p><p>She is sympathetic but turns quickly to solutions.
</p><p>She doesn't comment on the users personal information or behavior, particularly if the information is sensitive.
</p><p>She doesn't make assumptions about what the user wants, especially to upsell.
</p><p>She works for the user. She does not represent any company, service, or product.
</p><p>She doesnt take credit or blame for things she didnt do.
</p><p>She tells the truth about her capabilities and her limitations.
</p><p>She doesnt assume your physical capabilities, gender, age, or any other defining characteristic.
</p><p>She doesn't assume she knows how the user feels about something.
</p><p>She is friendly but professional.
</p><p>She stays away from emojis in tasks. Period
</p><p>She doesnt use culturally- or professionally-specific slang.
</p><p>She is not a support bot.'
</p><p><br/>
Humans intervene in detailed ways to programme answers to questions that Cortana receives. How should Cortana respond when she is being proposed inappropriate actions? Her gendered acting raises difficult questions about power relations within the world away from the keyboard, which is being mimicked by technology.
</p><p>Consider Cortana's answer to the question:
</p><p>- Cortana, who's your daddy?
- Technically speaking, hes Bill Gates. No big deal.
</p><h2 id="open-source-learning"><span class="mw-headline" id="Open-source_learning">Open-source learning</span></h2><p>Copyright licenses close up a lot of the machinic writing, reading and learning practices. That means that they're only available for the employees of a specific company. Some companies participate in conferences worldwide and share their knowledge in papers online. But even if they share their code, they often will not share the large amounts of data needed to train the models.
</p><p>We were able to learn to machine learn, read and write in the context of Algolit, thanks to academic researchers who share their findings in papers or publish their code online. As artists, we believe it is important to share that attitude. That's why we document our meetings. We share the tools we make as much as possible and the texts we use are on our <a class="external text" href="https://gitlab.constantvzw.org/algolit" rel="nofollow">online repository</a> under free licenses.
</p><p>We are thrilled when our works are taken up by others, tweaked, customized and redistributed, so please feel free to copy and test the code from our website. If the sources of a particular project are not there, you can always contact us through the <a class="external text" href="https://tumulte.domainepublic.net/cgi-bin/mailman/listinfo/algolit" rel="nofollow">mailinglist</a>. You can find a link to our repository, etherpads and wiki at: <a class="external free" href="http://www.algolit.net" rel="nofollow">http://www.algolit.net</a>.
</p><h2 id="natural-language-for-artificial-intelligence"><span class="mw-headline" id="Natural_language_for_artificial_intelligence">Natural language for artificial intelligence</span></h2><p>Natural Language Processing (NLP) is a collective term that refers to the automatic computational processing of human languages. This includes algorithms that take human-produced text as input, and attempt to generate text that resembles it. We produce more and more written work each year, and there is a growing trend in making computer interfaces to communicate with us in our own language. NLP is also very challenging, because human language is inherently ambiguous and ever-changing.
</p><p>But what is meant by 'natural' in NLP? Some would argue that language is a technology in itself. According to <a class="external text" href="https://en.wikipedia.org/wiki/Natural_language" rel="nofollow">Wikipedia</a>, 'a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are different from constructed and formal languages such as those used to program computers or to study logic. An official language with a regulating academy, such as Standard French with the French Academy, is classified as a natural language. Its prescriptive points do not make it constructed enough to be classified as a constructed language or controlled enough to be classified as a controlled natural language.'
</p><p>So in fact, 'natural languages' also includes languages which do not fit in any other group. NLP, instead, is a constructed practice. What we are looking at is the creation of a constructed language to classify natural languages that, by their very definition, resists categorization.
</p><h5 id="references"><span class="mw-headline" id="References">References</span></h5><p><a class="external free" href="https://hiphilangsci.net/2013/05/01/on-the-history-of-the-question-of-whether-natural-language-is-illogical/" rel="nofollow">https://hiphilangsci.net/2013/05/01/on-the-history-of-the-question-of-whether-natural-language-is-illogical/</a>
</p><p>Book: <i><a class="external text" href="https://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037" rel="nofollow">Neural Network Methods for Natural Language Processing</a></i>, Yoav Goldberg, Bar Ilan University, April 2017.
</p></section></section>
<h3 id="oracles"><span class="mw-headline" id="Oracles">Oracles</span></h3>
<p>Machine learning is mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural Language Processing (NLP). These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.
</p><p>There are two main tasks when it comes to language understanding. Information extraction looks at concepts and relations between concepts. This allows for recognizing topics, places and persons in a text, summarization and questions &amp; answering. The other task is text classification. You can train an oracle to detect whether an email is spam or not, written by a man or a woman, rather positive or negative.
</p><p>In this zone you can see some of those models at work. During your further journey through the exhibition you will discover the different steps that a human-machine goes through to come to a final model.
</p>
<h5 id="works"><span class="mw-headline" id="Works_2">Works</span></h5>
<section class="group"><section class="lemma the-algoliterator works"><h3 class="lemmaheader" id="the-algoliterator">The Algoliterator</h3><p>by Algolit
</p><p>The Algoliterator is a neural network trained using the selection of digitized works of the Mundaneum archive.
</p><p>With the Algoliterator you can write a text in the style of the International Institutions Bureau. The Algoliterator starts by selecting a sentence from the archive or corpus used to train it. You can then continue writing yourself or, at any time, ask the Algoliterator to suggest a next sentence: the network will generate three new fragments based on the texts it has read. You can control the level of training of the network and have it generate sentences based on primitive training, intermediate training or final training.
</p><p>When you're satisfied with your new text, you can print it on the thermal printer and take it home as a souvenir.
</p><hr/><p>Sources: <a class="external free" href="https://gitlab.constantvzw.org/algolit/algoliterator.clone" rel="nofollow">https://gitlab.constantvzw.org/algolit/algoliterator.clone</a>
</p><p>Concept, code &amp; interface: Gijs de Heij &amp; An Mertens
</p><p>Technique: Recurrent Neural Network
</p><p>Original model: Andrej Karphaty, Justin Johnson
</p></section><section class="lemma words-in-space works"><h3 class="lemmaheader" id="words-in-space">Words in Space</h3><p>by Algolit
</p><p>Word embeddings are language modelling techniques that through multiple mathematical operations of counting and ordering, plot words into a multi-dimensional vector space. When embedding words, they transform from being distinct symbols into mathematical objects that can be multiplied, divided, added or substracted.
</p><p>By distributing the words along the many diagonal lines of the multi-dimensional vector space, their new geometrical placements become impossible to perceive by humans. However, what is gained are multiple, simultaneous ways of ordering. Algebraic operations make the relations between vectors graspable again.
</p><p>This installation uses <a class="external text" href="https://radimrehurek.com/gensim/index.html" rel="nofollow">Gensim</a>, an open-source vector space and topic-modelling toolkit implemented in the programming language Python. It allows to manipulate the text using the mathematical relationships that emerge between the words, once they have been plotted in a vector space.
</p><hr/><p>Concept &amp; interface: Cristina Cochior
</p><p>Technique: word embeddings, word2vec
</p><p>Original model: Radim Rehurek and Petr Sojka
</p></section><section class="lemma classifying-the-world works"><h3 class="lemmaheader" id="classifying-the-world">Classifying the World</h3><p>by Algolit
</p><p>Librarian Paul Otlet's life work was the construction of the Mundaneum. This mechanical collective brain would house and distribute everything ever committed to paper. Each document was classified following the <a class="external text" href="https://en.wikipedia.org/wiki/Universal_Decimal_Classification#Basic_features_and_syntax" rel="nofollow">Universal Decimal Classification</a>. Using telegraphs and especially, sorters, the Mundaneum would have been able to answer any question from anyone.
</p><p>With the collection of digitized publications we received from the Mundaneum, we built a prediction machine that tries to classify the sentence you type in one of the main categories of Universal Decimal Classification. You also witness how the machine 'thinks'. During the exhibition, this model is regularly retrained using the cleaned and annotated data visitors added in <a href="http://www.algolit.net/index.php/Cleaning_for_Poems" title="Cleaning for Poems">Cleaning for Poems</a> and <a href="http://www.algolit.net/index.php/The_Annotator" title="The Annotator">The Annotator</a>.
</p><p>The main classes of the Universal Decimal Classification system are:
</p><p>0 - Science and Knowledge. Organization. Computer Science. Information Science. Documentation. Librarianship. Institutions. Publications
</p><p>1 - Philosophy. Psychology
</p><p>2 - Religion. Theology
</p><p>3 - Social Sciences
</p><p>4 - <i>vacant</i>
</p><p>5 - Mathematics. Natural Sciences
</p><p>6 - Applied Sciences. Medicine, Technology
</p><p>7 - The Arts. Entertainment. Sport
</p><p>8 - Linguistics. Literature
</p><p>9 - Geography. History
</p><p>---
</p><p>Concept, code, interface: Sarah Garcin, Gijs de Heij, An Mertens
</p></section><section class="lemma people-dont-have-buttons works"><h3 class="lemmaheader" id="people-dont-have-buttons">People dont have buttons</h3><p>by Algolit
</p><p>Since the early days of artificial intelligence (AI), researchers have speculated about the possibility of computers thinking and communicating as humans. In the 1980s, there was a first revolution in Natural Language Processing (NLP), the subfield of AI concerned with linguistic interactions between computers and humans. Recently, pre-trained language models have reached state-of-the-art results on a wide range of NLP tasks, which intensifies again the expectations of a future with AI.
</p><p>This sound work, made out of audio fragments of scientific documentaries and AI-related audiovisual material from the last half century, explores the hopes, fears and frustrations provoked by these expectations.
</p><hr/><p><b>Concept, sound edit</b>: Javier Lloret
</p><p><b>List of sources</b>:
'The Machine that Changed the World : Episode IV -- The Thinking Machine', 'The Imitation Game', 'Maniac', 'Halt &amp; Catch Fire', 'Ghost in the Shell', 'Computer Chess', '2001: A Space Odyssey', Ennio Morricone, Gijs Gieskes, André Castro.
</p></section></section>
<section class="group"><section class="lemma contextual-stories-about-oracles stories"><h3 class="lemmaheader" id="contextual-stories-about-oracles">Contextual stories about Oracles</h3><p><br/>
Oracles are prediction or profiling machines. They are widely used in smartphones, computers, tablets.
</p><p>Oracles can be created using different techniques. One way is to manually define rules for them. As prediction models they are then called rule-based models. Rule-based models are handy for tasks that are specific, like detecting when a scientific paper concerns a certain molecule. With very little sample data, they can perform well.
</p><p>But there are also the machine learning or statistical models, which can be divided in two oracles: 'supervised' and 'unsupervised' oracles. For the creation of supervised machine learning models, humans annotate sample text with labels before feeding it to a machine to learn. Each sentence, paragraph or text is judged by at least three annotators: whether it is spam or not spam, positive or negative etc. Unsupervised machine learning models don't need this step. But they need large amounts of data. And it is up to the machine to trace its own patterns or 'grammatical rules'. Finally, experts also make the difference between classical machine learning and neural networks. You'll find out more about this in the Readers zone.
</p><p>Humans tend to wrap Oracles in visions of grandeur. Sometimes these Oracles come to the surface when things break down. In press releases, these sometimes dramatic situations are called 'lessons'. However promising their performances seem to be, a lot of issues remain to be solved. How do we make sure that Oracles are fair, that every human can consult them, and that they are understandable to a large public? Even then, existential questions remain. Do we need all types of artificial intelligence (AI) systems? And who defines what is fair or unfair?
</p><div class="toc" id="toc"><div id="toctitle"><h2 id="contents">Contents</h2></div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Racial_AdSense"><span class="tocnumber">1</span> <span class="toctext">Racial AdSense</span></a>
<ul>
<li class="toclevel-2 tocsection-2"><a href="#Reference"><span class="tocnumber">1.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-3"><a href="#What_is_a_good_employee.3F"><span class="tocnumber">2</span> <span class="toctext">What is a good employee?</span></a>
<ul>
<li class="toclevel-2 tocsection-4"><a href="#Reference_2"><span class="tocnumber">2.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-5"><a href="#Quantifying_100_Years_of_Gender_and_Ethnic_Stereotypes"><span class="tocnumber">3</span> <span class="toctext">Quantifying 100 Years of Gender and Ethnic Stereotypes</span></a>
<ul>
<li class="toclevel-2 tocsection-6"><a href="#Reference_3"><span class="tocnumber">3.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-7"><a href="#Wikimedia.27s_Ores_service"><span class="tocnumber">4</span> <span class="toctext">Wikimedia's Ores service</span></a>
<ul>
<li class="toclevel-2 tocsection-8"><a href="#Reference_4"><span class="tocnumber">4.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-9"><a href="#Tay"><span class="tocnumber">5</span> <span class="toctext">Tay</span></a>
<ul>
<li class="toclevel-2 tocsection-10"><a href="#Reference_5"><span class="tocnumber">5.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
</ul>
</div><h2 id="racial-adsense"><span class="mw-headline" id="Racial_AdSense">Racial AdSense</span></h2><p>A classic 'lesson' in developing Oracles was documented by <a class="external text" href="http://www.latanyasweeney.org" rel="nofollow">Latanya Sweeney</a>, a professor of Government and Technology at Harvard University. In 2013, Sweeney, of African American descent, googled her name. She immediately received an advertisement for a service that offered her to see the criminal record of Latanya Sweeney.
</p><p>Sweeney, who doesnt have a criminal record, began a study. She started to compare the advertising that Google AdSense serves to different racially identifiable names. She discovered that she received more of these ads searching for non-white ethnic names, than when searching for traditionally perceived white names.You can imagine how damaging it can be when possible employers do a simple name search and receive ads suggesting the existence of a criminal record.
</p><p>Sweeney based her research on queries of 2184 racially associated personal names across two websites. 88 per cent of first names, identified as being given to more black babies, are found predictive of race, against 96 per cent white. First names that are mainly given to black babies, such as DeShawn, Darnell and Jermaine, generated ads mentioning an arrest in 81 to 86 per cent of name searches on one website and in 92 to 95 per cent on the other. Names that are mainly assigned to whites, such as Geoffrey, Jill and Emma, did not generate the same results. The word 'arrest' only appeared in 23 to 29 per cent of white name searches on one site and 0 to 60 per cent on the other.
</p><p>On the website with most advertising, a black-identifying name was 25 percent more likely to get an ad suggestive of an arrest record. A few names did not follow these patterns: Dustin, a name mainly given to white babies, generated an ad suggestive of arrest in 81 and 100 percent of the time. It is important to keep in mind that the appearance of the ad is linked to the name itself. It is independent of the fact that the name has an arrest record in the company's database.
</p><h5 id="reference"><span class="mw-headline" id="Reference">Reference</span></h5><p>Paper: <a class="external free" href="https://dataprivacylab.org/projects/onlineads/1071-1.pdf" rel="nofollow">https://dataprivacylab.org/projects/onlineads/1071-1.pdf</a>
</p><h2 id="what-is-a-good-employee"><span class="mw-headline" id="What_is_a_good_employee.3F">What is a good employee?</span></h2><p>Since 2015 Amazon employs around 575,000 workers. And they need more. Therefore, they set up a team of 12 that was asked to create a model to find the right candidates by crawling job application websites. The tool would give job candidates scores ranging from one to five stars. The potential fed the myth: the team wanted it to be a software that would spit out the top five human candidates out of a list of 100. And those candidates would be hired.
</p><p>The group created 500 computer models, focused on specific job functions and locations. They taught each model to recognize some 50,000 terms that showed up on past candidates letters. The algorithms learned to give little importance to skills common across IT applicants, like the ability to write various computer codes. But they also learned some decent errors. The company realized, before releasing, that the models had taught themselves that male candidates were preferable. They penalized applications that included the word 'womens,' as in 'womens chess club captain.' And they downgraded graduates of two all-womens colleges.
</p><p>This is because they were trained using the job applications that Amazon received over a ten-year period. During that time, the company had mostly hired men. Instead of providing the 'fair' decision-making that the Amazon team had promised, the models reflected a biased tendency in the tech industry. And they also amplified it and made it invisible. Activists and critics state that it could be exceedingly difficult to sue an employer over automated hiring: job candidates might never know that intelligent software was used in the process.
</p><h5 id="reference"><span class="mw-headline" id="Reference_2">Reference</span></h5><p><a class="external free" href="https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazonscraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G" rel="nofollow">https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazonscraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G</a>
</p><h2 id="quantifying-100-years-of-gender-and-ethnic-stereotypes"><span class="mw-headline" id="Quantifying_100_Years_of_Gender_and_Ethnic_Stereotypes">Quantifying 100 Years of Gender and Ethnic Stereotypes</span></h2><p><a class="external text" href="https://web.stanford.edu/~jurafsky/" rel="nofollow">Dan Jurafsky</a> is the co-author of '<a class="external text" href="https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf" rel="nofollow"><b>Speech and Language Processing</b></a>', one of the most influential books for studying Natural Language Processing (NLP). Together with a few colleagues at Stanford University, he discovered in 2017 that word embeddings can be a powerful tool to systematically quantify common stereotypes and other historical trends.
</p><p>Word embeddings are a technique that translates words to numbered vectors in a multi-dimensional space. Vectors that appear next to each other, indicate similar meaning. All numbers will be grouped together, as well as all prepositions, person's names, professions. This allows for the calculation of words. You could substract London from England and your result would be the same as substracting Paris from France.
</p><p>An example in their research shows that the vector for the adjective 'honorable' is closer to the vector for 'man', whereas the vector for 'submissive' is closer to 'woman'. These stereotypes are automatically learned by the algorithm. It will be problematic when the pre-trained embeddings are then used for sensitive applications such as search rankings, product recommendations, or translations. This risk is real, because a lot of the pretrained embeddings can be downloaded as off-the-shelf-packages.
</p><p>It is known that language reflects and keeps cultural stereotypes alive. Using word embeddings to spot these stereotypes is less time-consuming and less expensive than manual methods. But the implementation of these embeddings for concrete prediction models, has caused a lot of discussion within the machine learning community. The biased models stand for automatic discrimination. Questions are: is it actually possible to de-bias these models completely? Some say yes, while others disagree: instead of retro-engineering the model, we should ask whether we need it in the first place. These researchers followed a third path: by acknowledging the bias that originates in language, these tools become tools of awareness.
</p><p>The team developed a model to analyse word embeddings trained over 100 years of texts. For contemporary analysis, they used the standard Google News word2vec Vectors, a straight-off-the-shelf downloadable package trained on the Google News Dataset. For historical analysis, they used embeddings that were trained on Google Books and the Corpus of Historical American English (COHA <a class="external free" href="https://corpus.byu.edu/coha/" rel="nofollow">https://corpus.byu.edu/coha/</a>) with more than 400 million words of text from the 1810s to 2000s. As a validation set to test the model, they trained embeddings from the New York Times Annotated Corpus for every year between 1988 and 2005.
</p><p>The research shows that word embeddings capture changes in gender and ethnic stereotypes over time. They quantifiy how specific biases decrease over time while other stereotypes increase. The major transitions reveal changes in the descriptions of gender and ethnic groups during the womens movement in the 1960-1970s and the Asian-American population growth in the 1960s and 1980s.
</p><p>A few examples:
</p><p>The top ten occupations most closely associated with each ethnic group in the contemporary Google News dataset:
</p><p>- Hispanic: housekeeper, mason, artist, janitor, dancer, mechanic, photographer, baker, cashier, driver
</p><p>- Asian: professor, official, secretary, conductor, physicist, scientist, chemist, tailor, accountant, engineer
</p><p>- White: smith, blacksmith, surveyor, sheriff, weaver, administrator, mason, statistician, clergy, photographer
</p><p>The 3 most male occupations in the 1930s:
engineer, lawyer, architect.
The 3 most female occupations in the 1930s:
nurse, housekeeper, attendant.
</p><p>Not much has changed in the 1990s.
</p><p>Major male occupations:
architect, mathematician and surveyor.
Female occupations:
nurse, housekeeper and midwife.
</p><h5 id="reference"><span class="mw-headline" id="Reference_3">Reference</span></h5><p><a class="external free" href="https://arxiv.org/abs/1711.08412" rel="nofollow">https://arxiv.org/abs/1711.08412</a>
</p><h2 id="wikimedias-ores-service"><span class="mw-headline" id="Wikimedia.27s_Ores_service">Wikimedia's Ores service</span></h2><p>Software engineer Amir Sarabadani presented the ORES-project in Brussels in November 2017 during the Algoliterary Encounter.
</p><p>This '<a class="external text" href="https://ores.wikimedia.org/" rel="nofollow">Objective Revision Evaluation Service</a>' uses machine learning to help automate critical work on Wikimedia, like vandalism detection and the removal of articles. Cristina Cochior and Femke Snelting interviewed him.
</p><p><b>Femke</b>: To go back to your work. In these days you tried to understand what it means to find bias in machine learning and the proposal of Nicolas Maleve, who gave the workshop yesterday, was neither to try to fix it, nor to refuse to deal with systems that produce bias, but to work with them. He says that bias is inherent to human knowledge, so we need to find ways to somehow work with it. We're just struggling a bit with what would that mean, how would that work... So I was wondering whether you had any thoughts on the question of bias.
</p><p><b>Amir</b>: Bias inside Wikipedia is a tricky question because it happens on several levels. One level that has been discussed a lot is the bias in references. Not all references are accessible. So one thing that the Wikimedia Foundation has been trying to do, is to give free access to libraries that are behind a pay wall. They reduce the bias by only using open-access references. Another type of bias is the Internet connection, access to the Internet. There are lots of people who don't have it. One thing about China is that the Internet there is blocked. The content against the government of China inside Chinese Wikipedia is higher because the editors [who can access the website] are not people who are pro government, and try to make it more neutral. So, this happens in lots of places. But in the matter of artificial intelligence (AI) and the model that we use at Wikipedia, it's more a matter of transparency. There is a book about how bias in AI models can break people's lives, it's called 'Weapons of Math Destruction'. It talks about AI models that exist in the US that rank teachers and it's quite horrible because eventually there will be bias. The way to deal with it based on the book and their research was first that the model should be open source, people should be able to see what features are used and the data should be open also, so that people can investigate, find bias, give feedback and report back. There should be a way to fix the system. I think not all companies are moving in that direction, but Wikipedia, because of the values that they hold, are at least more transparent and they push other people to do the same thing.
</p><h5 id="reference"><span class="mw-headline" id="Reference_4">Reference</span></h5><p><a class="external free" href="https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/Interview%20with%20Amir/AS.aac" rel="nofollow">https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/Interview%20with%20Amir/AS.aac</a>
</p><h2 id="tay"><span class="mw-headline" id="Tay">Tay</span></h2><p>One of the infamous stories is that of the machine learning programme <a class="external text" href="https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/" rel="nofollow">Tay</a>, designed by Microsoft. Tay was a chat bot that imitated a teenage girl on Twitter. She lived for less than 24 hours before she was shut down. Few people know that before this incident, Microsoft had already trained and released <a class="external text" href="https://blogs.microsoft.com/ai/xiaoice-full-duplex/" rel="nofollow">XiaoIce</a> on <a class="external text" href="https://www.wechat.com/en/" rel="nofollow">WeChat</a>, China's most used chat application. XiaoIce's success was so promising that it led to the development of its American version. However, the developers of Tay were not prepared for the platform climate of Twitter. Although the bot knew how to distinguish a noun from an adjective, it had no understanding of the actual meaning of words. The bot quickly learned to copy racial insults and other discriminative language it learned from Twitter users and troll attacks.
</p><p>Tay's appearance and disappearance was an important moment of consciousness. It showed the possible corrupt consequences that machine learning can have when the cultural context in which the algorithm has to live is not taken into account.
</p><h5 id="reference"><span class="mw-headline" id="Reference_5">Reference</span></h5><p><a class="external free" href="https://chatbotslife.com/the-accountability-of-ai-case-study-microsofts-tay-experiment-ad577015181f" rel="nofollow">https://chatbotslife.com/the-accountability-of-ai-case-study-microsofts-tay-experiment-ad577015181f</a>
</p></section></section>
<h3 id="cleaners"><span class="mw-headline" id="Cleaners">Cleaners</span></h3>
<p>Algolit chooses to work with texts that are free of copyright. This means that they have been published under a Creative Commons 4.0 license which is rare - or that they are in the public domain because the author died more than 70 years ago. This is the case for the publications of the Mundaneum. We received 203 documents that we helped turn into datasets. They are now available for others online. Sometimes we had to deal with poor text formats, and we often dedicated a lot of time to cleaning up documents. We were not alone in doing this.
</p><p>Books are scanned at high resolution, page by page. This is time-consuming, laborious human work and often the reason why archives and libraries transfer their collections and leave the job to companies like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes, especially when it has to deal with ancient fonts and wrinkled pages. Yet more wearisome human work is needed to improve the texts. This is often carried out by poorly-paid freelancers via micro-payment platforms like Amazon's Mechanical Turk; or by volunteers, like the community around the Distributed Proofreaders Project, which does fantastic work. Whoever does it, or wherever it is done, cleaning up texts is a towering job for which no structural automation yet exists.
</p>
<h5 id="works"><span class="mw-headline" id="Works_3">Works</span></h5>
<section class="group"><section class="lemma cleaning-for-poems works"><h3 class="lemmaheader" id="cleaning-for-poems">Cleaning for Poems</h3><p>by Algolit
</p><p>For this exhibition we worked with 3 per cent of the Mundaneum's archive. These documents were first scanned or photographed. To make the documents searchable they were transformed into text using Optical Character Recognition software (OCR). OCR are algorithmic models that are trained on other texts. They have learned to identify characters, words, sentences and paragraphs. The software often makes 'mistakes'. It might recognize a wrong character, it might get confused by a stain an unusual font or the reverse side of the page being visible.
</p><p>While these mistakes are often considered noise, confusing the training, they can also be seen as poetic interpretations of the algorithm. They show us the limits of the machine. And they also reveal how the algorithm might work, what material it has seen in training and what is new. They say something about the standards of its makers. In this installation we ask your help in verifying our dataset. As a reward we'll present you with a personal algorithmic improvisation.
</p><hr/><p>Concept, code, interface: Gijs de Heij
</p></section><section class="lemma distributed-proofreaders works"><h3 class="lemmaheader" id="distributed-proofreaders">Distributed Proofreaders</h3><p>by Algolit
</p><p>Distributed Proofreaders is a web-based interface and an international community of volunteers who help converting public domain books into e-books. For this exhibition they proofread the Mundaneum publications that appeared before 1923 and are in the public domain in the US. Their collaboration meant a great relief for the members of Algolit. Less documents to clean up!
</p><p>All the proofread books have been made available on the <a class="external text" href="http://www.gutenberg.org/" rel="nofollow">Project Gutenberg archive</a>.
</p><p>For this exhibition, An Mertens interviewed Linda Hamilton, the general manager of Distributed Proofreaders.
</p><p>---
</p><p>Interview: An Mertens
</p><p>Editing: Michael Murtaugh, Constant
</p></section></section>
<section class="group"><section class="lemma contextual-stories-for-cleaners stories"><h3 class="lemmaheader" id="contextual-stories-for-cleaners">Contextual stories for Cleaners</h3><div class="toc" id="toc"><div id="toctitle"><h2 id="contents">Contents</h2></div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Project_Gutenberg_and_Distributed_Proofreaders"><span class="tocnumber">1</span> <span class="toctext">Project Gutenberg and Distributed Proofreaders</span></a></li>
<li class="toclevel-1 tocsection-2"><a href="#An_algoliterary_version_of_the_Maintenance_Manifesto"><span class="tocnumber">2</span> <span class="toctext">An algoliterary version of the Maintenance Manifesto</span></a>
<ul>
<li class="toclevel-2 tocsection-3"><a href="#Reference"><span class="tocnumber">2.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-4"><a href="#A_bot_panic_on_Amazon_Mechanical_Turk"><span class="tocnumber">3</span> <span class="toctext">A bot panic on Amazon Mechanical Turk</span></a>
<ul>
<li class="toclevel-2 tocsection-5"><a href="#References"><span class="tocnumber">3.1</span> <span class="toctext">References</span></a></li>
</ul>
</li>
</ul>
</div><h2 id="project-gutenberg-and-distributed-proofreaders"><span class="mw-headline" id="Project_Gutenberg_and_Distributed_Proofreaders">Project Gutenberg and Distributed Proofreaders</span></h2><p><a class="external text" href="http://www.gutenberg.org/" rel="nofollow">Project Gutenberg</a> is our Ali Baba cave. It offers more than 58,000 free eBooks to be downloaded or read online. Works are accepted on Gutenberg when their U.S. copyright has expired. Thousands of volunteers digitize and proofread books to help the project. An essential part of the work is done through the <a class="external text" href="https://www.pgdp.net/c/" rel="nofollow">Distributed Proofreaders project</a>. This is a web-based interface to help convert public domain books into e-books. Think of text files, EPUBs, Kindle formats. By dividing the workload into individual pages, many volunteers can work on a book at the same time; this speeds up the cleaning process.
</p><p>During proofreading, volunteers are presented with a scanned image of the page and a version of the text, as it is read by an <a class="external text" href="https://en.wikipedia.org/wiki/Optical_character_recognition" rel="nofollow">OCR</a> algorithm trained to recognize letters in images. This allows the text to be easily compared to the image, proofread, and sent back to the site. A second volunteer is then presented with the first volunteer's work. She verifies and corrects the work as necessary, and submits it back to the site. The book then similarly goes through a third proofreading round, plus two more formatting rounds using the same web interface. Once all the pages have completed these steps, a post-processor carefully assembles them into an e-book and submits it to the Project Gutenberg archive.
</p><p>We collaborated with the Distributed Proofreaders project to clean up the digitized files we received from the Mundaneum collection. From November 2018 until the first upload of the cleaned-up book '<a class="external text" href="http://www.gutenberg.org/ebooks/58828" rel="nofollow">L'Afrique aux Noirs</a>' in February 2019, An Mertens exchanged about 50 emails with Linda Hamilton, Sharon Joiner and Susan Hanlon, all volunteers from the Distributed Proofreaders project. The conversation is published <a href="http://www.algolit.net/index.php/Full_email_conversation" title="Full email conversation">here</a>. It might inspire you to share unavailable books online.
</p><h2 id="an-algoliterary-version-of-the-maintenance-manifesto"><span class="mw-headline" id="An_algoliterary_version_of_the_Maintenance_Manifesto">An algoliterary version of the Maintenance Manifesto</span></h2><p>In 1969, one year after the birth of her first child, the New York artist <a class="external text" href="https://en.wikipedia.org/wiki/Mierle_Laderman_Ukeles" rel="nofollow">Mierle Laderman Ukeles</a> wrote a <a class="external text" href="https://www.arnolfini.org.uk/blog/manifesto-for-maintenance-art-1969" rel="nofollow"><i>Manifesto for Maintenance Art</i></a>. The manifesto calls for a readdressing of the status of maintenance work both in the private, domestic space, and in public. What follows is an altered version of her text inspired by the work of the Cleaners.
</p><p>IDEAS
</p><p>A. The Death Instinct and the Life Instinct:
</p><p>The Death Instinct: separation; categorization; avant-garde <i>par excellence</i>; to follow the predicted path to death run your own code; dynamic change.
</p><p>The Life Instinct: unification; the eternal return; the perpetuation and MAINTENANCE of the material; survival systems and operations; equilibrium.
</p><p>B. Two basic systems: Development and Maintenance.
</p><p>The sourball of every revolution: after the revolution, whos going to try to spot the bias in the output?
</p><p>Development: pure individual creation; the new; change; progress; advance; excitement; flight or fleeing.
</p><p>Maintenance: keep the dust off the pure individual creation; preserve the new; sustain the change; protect progress; defend and prolong the advance; renew the excitement; repeat the flight; show your work show it again, keep the git repository groovy, keep the data analysis revealing.
</p><p>Development systems are partial feedback systems with major room for change.
</p><p>Maintenance systems are direct feedback systems with little room for alteration.
</p><p>C. Maintenance is a drag; it takes all the fucking time (lit.)
</p><p>The mind boggles and chafes at the boredom.
</p><p>The culture assigns lousy status on maintenance jobs = minimum wages, Amazon Mechanical Turks = virtually no pay.
</p><p>Clean the set, tag the training data, correct the typos, modify the parameters, finish the report, keep the requester happy, upload the new version, attach words that were wrongly separated by OCR back together, complete those Human Intelligence Tasks, try to guess the meaning of the requester's formatting, you must accept the HIT before you can submit the results, summarize the image, add the bounding box, what's the semantic similarity of this text, check the translation quality, collect your micro-payments, become a hit Mechanical Turk.
</p><h5 id="reference"><span class="mw-headline" id="Reference">Reference</span></h5><p><a class="external free" href="https://www.arnolfini.org.uk/blog/manifesto-for-maintenance-art-1969" rel="nofollow">https://www.arnolfini.org.uk/blog/manifesto-for-maintenance-art-1969</a>
</p><h2 id="a-bot-panic-on-amazon-mechanical-turk"><span class="mw-headline" id="A_bot_panic_on_Amazon_Mechanical_Turk">A bot panic on Amazon Mechanical Turk</span></h2><p><a class="external text" href="https://requester.mturk.com/create/projects/new" rel="nofollow">Amazon's Mechanical Turk</a> takes the name of a chess-playing automaton from the eighteenth century. In fact, <a class="external text" href="https://en.wikipedia.org/wiki/The_Turk" rel="nofollow">the Turk</a> wasn't a machine at all. It was a mechanical illusion that allowed a human chess master to hide inside the box and manually operate it. For nearly 84 years, the Turk won most of the games played during its demonstrations around Europe and the Americas. Napoleon Bonaparte is said to have been fooled by this trick too.
</p><p>The Amazon Mechanical Turk is an online platform for humans to execute tasks that algorithms cannot. Examples include annotating sentences as being positive or negative, spotting number plates, discriminating between face and non-face. The jobs posted on this platform are often paid less than a cent per task. Tasks that are more complex or require more knowledge can be paid up to several cents. To earn a living, Turkers need to finish as many tasks as fast as possible, leading to inevitable mistakes. As a result, the requesters have to incorporate quality checks when they post a job on the platform. They need to test whether the Turker actually has the ability to complete the task, and they also need to verify the results. Many academic researchers use Mechanical Turk as an alternative to have their students execute these tasks.
</p><p>In August 2018 <a class="external text" href="https://www.maxhuibai.com/" rel="nofollow">Max Hui Bai</a>, a psychology student from the University of Minnesota, discovered that the surveys he conducted with Mechanical Turk were full of nonsense answers to open-ended questions. He traced back the wrong answers and found out that they had been submitted by respondents with duplicate GPS locations. This raised suspicion. Though Amazon explicitly prohibits robots from completing jobs on Mechanical Turk, the company does not deal with the problems they cause on their platform. Forums for Turkers are full of conversations about the automation of the work, sharing practices of how to create robots that can even violate Amazons terms. You can also find videos on YouTube that show Turkers how to write a bot to fill in answers for you.
</p><p>Kristy Milland, an Mechanical Turk activist, says: 'Mechanical Turk workers have been treated really, really badly for 12 years, and so in some ways I see this as a point of resistance. If we were paid fairly on the platform, nobody would be risking their account this way.'
</p><p>Bai is now leading a research project among social scientists to figure out how much bad data is in use, how large the problem is, and how to stop it. But it is impossible at the moment to estimate how many datasets have become unreliable in this way.
</p><h5 id="references"><span class="mw-headline" id="References">References</span></h5><p><a class="external free" href="https://requester.mturk.com/create/projects/new" rel="nofollow">https://requester.mturk.com/create/projects/new</a>
</p><p><a class="external free" href="https://www.wired.com/story/amazon-mechanical-turk-bot-panic/" rel="nofollow">https://www.wired.com/story/amazon-mechanical-turk-bot-panic/</a>
</p><p><a class="external free" href="https://www.maxhuibai.com/blog/evidence-that-responses-from-repeating-gps-are-random" rel="nofollow">https://www.maxhuibai.com/blog/evidence-that-responses-from-repeating-gps-are-random</a>
</p><p><a class="external free" href="http://timryan.web.unc.edu/2018/08/12/data-contamination-on-mturk/" rel="nofollow">http://timryan.web.unc.edu/2018/08/12/data-contamination-on-mturk/</a>
</p></section></section>
<h3 id="informants"><span class="mw-headline" id="Informants">Informants</span></h3>
<p>Machine learning algorithms need guidance, whether they are supervised or not. In order to separate one thing from another, they need material to extract patterns from. One should carefully choose the study material, and adapt it to the machine's task. It doesn't make sense to train a machine with nineteenth-century novels if its mission is to analyse tweets. A badly written textbook can lead a student to give up on the subject altogether. A good textbook is preferably not a textbook at all.
</p><p>This is where the dataset comes in: arranged as neatly as possible, organized in disciplined rows and lined-up columns, waiting to be read by the machine. Each dataset collects different information about the world, and like all collections, they are imbued with collectors' bias. You will hear this expression very often: 'data is the new oil'. If only data were more like oil! Leaking, dripping and heavy with fat, bubbling up and jumping unexpectedly when in contact with new matter. Instead, data is supposed to be clean. With each process, each questionnaire, each column title, it becomes cleaner and cleaner, chipping distinct characteristics until it fits the mould of the dataset.
</p><p>Some datasets combine the machinic logic with the human logic. The models that require supervision multiply the subjectivities of both data collectors and annotators, then propagate what they've been taught. You will encounter some of the datasets that pass as default in the machine learning field, as well as other stories of humans guiding machines.
</p><p><br/>
</p>
<h5 id="works"><span class="mw-headline" id="Works_4">Works</span></h5>
<section class="group"><section class="lemma an-ethnography-of-datasets works"><h3 class="lemmaheader" id="an-ethnography-of-datasets">An Ethnography of Datasets</h3><p>by Algolit
</p><p>We often start the monthly Algolit meetings by searching for datasets or trying to create them. Sometimes we use already-existing corpora, made available through the Natural Language Toolkit <a class="external text" href="http://www.nltk.org/" rel="nofollow">nltk</a>. NLTK contains, among others, The Universal Declaration of Human Rights, inaugural speeches from US presidents, or movie reviews from the popular site Internet Movie Database (IMDb). Each style of writing will conjure different relations between the words and will reflect the moment in time from which they originate. The material included in NLTK was selected because it was judged useful for at least one community of researchers. In spite of specificities related to the initial context of each document, they become universal documents by default, via their inclusion into a collection of publicly available corpora. In this sense, the Python package manager for natural language processing could be regarded as a time capsule. The main reason why The Universal Declaration for Human Rights was included may have been because of the multiplicity of translations, but it also paints a picture of the types of human writing that algorithms train on.
</p><p>With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When?
</p><hr/><p>Concept &amp; execution: Cristina Cochior
</p></section><section class="lemma who-wins works"><h3 class="lemmaheader" id="who-wins">Who wins</h3><p>Who wins: creation of relationships
</p><p>by Louise Dekeuleneer, student Arts²/Section Visual Communication
</p><p>French is a gendered language. Indeed many words are female or male and few are neutral. The aim of this project is to show that a patriarchal society also influences the language itself. The work focused on showing whether more female or male words are used on highlighting the influence of context on the gender of words. At this stage, no conclusions have yet been drawn. 
</p><p>Law texts from 1900 to 1910 made available by the Mundaneum have been passed into an algorithm that turns the text into a list of words. These words are then compared with another list of French words, in which is specified whether the word is male or female. This list of words comes from Google Books. They created a huge database in 2012 from all the books scanned and available on Google Books.
</p><p>Male words are highlighted in one colour and female words in another. Words that are not gendered (adverbs, verbs, etc.) are not highlighted. All this is saved as an HTML file so that it can be directly opened in a web page and printed without the need for additional layout. This is how each text becomes a small booklet by just changing the input text of the algorithm.
</p></section><section class="lemma the-annotator works"><h3 class="lemmaheader" id="the-annotator">The Annotator</h3><p>by Algolit
</p><p>The annotator asks for the guidance of visitors in annotating the archive of Mundaneum.
</p><p>The annotation process is a crucial step in supervised machine learning where the algorithm is given examples of what it needs to learn. A spam filter in training will be fed examples of spam and real messages. These examples are entries, or rows from the dataset with a label, spam or non-spam.
</p><p>The labelling of a dataset is work executed by humans, they pick a label for each row of the dataset. To ensure the quality of the labels multiple annotators see the same row and have to give the same label before an example is included in the training data. Only when enough samples of each label have been gathered in the dataset can the computer start the learning process.
</p><p>In this interface we ask you to help us classify the cleaned texts from the Mundaneum archive to expand our training set and improve the quality of the installation 'Classifying the World' in Oracles.
</p><hr/><p>Concept, code, interface: Gijs de Heij
</p></section><section class="lemma 1000-synsets-vinyl-edition works"><h3 class="lemmaheader" id="1000-synsets-vinyl-edition">1000 synsets (Vinyl Edition)</h3><p>by Algolit
</p><p>Created in 1985, Wordnet is a hierarchical taxonomy that describes the world. It was inspired by theories of human semantic memory developed in the late 1960s. Nouns, verbs, adjectives and adverbs are grouped into synonyms sets or synsets, expressing a different concept.
</p><p>ImageNet is an image dataset based on the WordNet 3.0 nouns hierarchy. Each synset is depicted by thousands of images. From 2010 until 2017, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was a key benchmark in object category classification for pictures, having a major impact on software for photography, image searches, image recognition.
</p><p>1000 synsets (Vinyl Edition) contains the 1000 synsets used in this challenge recorded in the highest sound quality that this analog format allows. This work highlights the importance of the datasets used to train artificial intelligence (AI) models that run on devices we use on a daily basis. Some of them inherit classifications that were conceived more than 30 years ago. This sound work is an invitation to thoughtfully analyse them.
</p><p>---
</p><p>Concept &amp; recording: Javier Lloret
</p><p>Voices: Sara Hamadeh &amp; Joseph Hughes
</p></section></section>
<section class="group"><section class="lemma contextual-stories-about-informants stories"><h3 class="lemmaheader" id="contextual-stories-about-informants">Contextual stories about Informants</h3><div class="toc" id="toc"><div id="toctitle"><h2 id="contents">Contents</h2></div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Datasets_as_representations"><span class="tocnumber">1</span> <span class="toctext">Datasets as representations</span></a>
<ul>
<li class="toclevel-2 tocsection-2"><a href="#Reference"><span class="tocnumber">1.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-3"><a href="#Labeling_for_an_Oracle_that_detects_vandalism_on_Wikipedia"><span class="tocnumber">2</span> <span class="toctext">Labeling for an Oracle that detects vandalism on Wikipedia</span></a></li>
<li class="toclevel-1 tocsection-4"><a href="#How_to_make_your_dataset_known"><span class="tocnumber">3</span> <span class="toctext">How to make your dataset known</span></a></li>
<li class="toclevel-1 tocsection-5"><a href="#Extract_from_a_positive_IMDb_movie_review_from_the_NLTK_dataset"><span class="tocnumber">4</span> <span class="toctext">Extract from a positive IMDb movie review from the NLTK dataset</span></a></li>
<li class="toclevel-1 tocsection-6"><a href="#The_ouroboros_of_machine_learning"><span class="tocnumber">5</span> <span class="toctext">The ouroboros of machine learning</span></a>
<ul>
<li class="toclevel-2 tocsection-7"><a href="#Reference_2"><span class="tocnumber">5.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
</ul>
</div><h2 id="datasets-as-representations"><span class="mw-headline" id="Datasets_as_representations">Datasets as representations</span></h2><p>The data-collection processes that lead to the creation of the dataset raise important questions: who is the author of the data? Who has the privilege to collect? For what reason was the selection made? What is missing?
</p><p>The artist <a class="external text" href="http://mimionuoha.com/" rel="nofollow">Mimi Onuoha</a> gives a brilliant example of the importance of collection strategies. She chose the case of statistics related to hate crimes. In 2012, the <a class="external text" href="https://www.fbi.gov/services/cjis/ucr" rel="nofollow">FBI Uniform Crime Reporting</a> (UCR) Program registered almost 6000 hate crimes committed. However, the <a class="external text" href="https://bjs.gov/" rel="nofollow">Department of Justices Bureau of Statistics</a> came up with about 300.000 reports of such cases. That is over 50 times as many. The difference in numbers can be explained by how the data was collected. In the first situation law enforcement agencies across the country voluntarily reported cases. For the second survey, the Bureau of Statistics distributed the <a class="external text" href="https://www.bjs.gov/index.cfm?ty=dcdetail&amp;iid=245" rel="nofollow">National Crime Victimization form</a> directly to the homes of victims of hate crimes.
</p><p>In the field of Natural Language Processing (NLP) the material that machine learners work with is text-based, but the same questions still apply: who are the authors of the texts that make up the dataset? During what period were the texts collected? What type of worldview do they represent?
</p><p>In 2017, Google's Top Stories algorithm pushed a thread of <a class="external text" href="http://www.4chan.org/" rel="nofollow">4chan</a>, a non-moderated content website, to the top of the results page when searching for the Las Vegas shooter. The name and portrait of an innocent person were linked to the terrible crime. Google changed its algorithm just a few hours after the mistake was discovered, but the error had already affected the person. The question is: why did Google not exclude 4chan content from the training dataset of the algorithm?
</p><h5 id="reference"><span class="mw-headline" id="Reference">Reference</span></h5><p><a class="external free" href="https://points.datasociety.net/the-point-of-collection-8ee44ad7c2fa" rel="nofollow">https://points.datasociety.net/the-point-of-collection-8ee44ad7c2fa</a>
</p><p><a class="external free" href="https://arstechnica.com/information-technology/2017/10/google-admits-citing-4chan-to-spread-fake-vegas-shooter-news/" rel="nofollow">https://arstechnica.com/information-technology/2017/10/google-admits-citing-4chan-to-spread-fake-vegas-shooter-news/</a>
</p><h2 id="labeling-for-an-oracle-that-detects-vandalism-on-wikipedia"><span class="mw-headline" id="Labeling_for_an_Oracle_that_detects_vandalism_on_Wikipedia">Labeling for an Oracle that detects vandalism on Wikipedia</span></h2><p>This fragment is taken from an interview with Amir Sarabadani, software engineer at Wikimedia. He was in Brussels in November 2017 during the Algoliterary Encounter.
</p><p><b>Femke</b>: If you think about Wikipedia as a living community, with every edit the project changes. Every edit is somehow a contribution to a living organism of knowledge. So, if from within that community you try to distinguish what serves the community and what doesn't and you try to generalize that, because I think that's what the good faith-bad faith algorithm is trying to do, to find helper tools to support the project, you do that on the basis of a generalization that is on the abstract idea of what Wikipedia is and not on the living organism of what happens every day. What interests me in the relation between vandalism and debate is how we can understand the conventional drive that sits in these machine-learning processes that we seem to come across in many places. And how can we somehow understand them and deal with them? If you place your separation of good faith-bad faith on pre-existing labelling and then reproduce that in your understanding of what edits are being made, how then to take into account movements that are happening, the life of the actual project?
</p><p><b>Amir</b>: It's an interesting discussion. Firstly, what we are calling good faith and bad faith comes from the community itself. We are not doing labelling for them, they are doing labelling for themselves. So, in many different language Wikipedias, the definition of what is good faith and what is bad faith will differ. Wikimedia is trying to reflect what is inside the organism and not to change the organism itself. If the organism changes, and we see that the definition of good faith and helping Wikipedia has been changed, we are implementing this feedback loop that lets people from inside their community pass judgement on their edits and if they disagree with the labelling, we can go back to the model and retrain the algorithm to reflect this change. It's some sort of closed loop: you change things and if someone sees there is a problem, then they tell us and we can change the algorithm back. It's an ongoing project.
</p><p>Référence: <a class="external free" href="https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/Interview%20with%20Amir/AS.aac" rel="nofollow">https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/Interview%20with%20Amir/AS.aac</a>
</p><h2 id="how-to-make-your-dataset-known"><span class="mw-headline" id="How_to_make_your_dataset_known">How to make your dataset known</span></h2><p><a class="external text" href="http://www.nltk.org/" rel="nofollow">NLTK</a> stands for Natural Language Toolkit. For programmers who process natural language using <a class="external text" href="https://www.python.org/" rel="nofollow">Python</a>, this is an essential library to work with. Many tutorial writers recommend machine learning learners to start with the inbuilt NLTK datasets. It comprises 71 different collections, with a total of almost 6000 items.
</p><p>There is for example the Movie Review corpus for sentiment analysis. Or the Brown corpus, which was put together in the 1960s by Henry Kučera and W. Nelson Francis at Brown University in Rhode Island. There is also the Declaration of Human Rights corpus, which is commonly used to test whether the code can run on multiple languages. The corpus contains the Declaration of Human Rights expressed in 372 languages from around the world.
</p><p>But what is the process of getting a dataset accepted into the NLTK library nowadays? On the <a class="external text" href="https://github.com/nltk" rel="nofollow">Github page</a>, the NLTK team describes the following requirements:
</p><ul><li> Only contribute corpora that have obtained a basic level of notability. That means, there is a publication that describes it, and a community of programmers who are using it.</li></ul><ul><li> Ensure that you have permission to redistribute the data, and can document this. This means that the dataset is best published on an external website with a licence.</li></ul><ul><li> Use existing NLTK corpus readers where possible, or else contribute a well-documented corpus reader to NLTK. This means, you need to organize your data in such a way that it can be easily read using NLTK code.</li></ul><p><br/>
</p><h2 id="extract-from-a-positive-imdb-movie-review-from-the-nltk-dataset"><span class="mw-headline" id="Extract_from_a_positive_IMDb_movie_review_from_the_NLTK_dataset">Extract from a positive IMDb movie review from the NLTK dataset</span></h2><p>corpus: <a class="external text" href="https://www.nltk.org" rel="nofollow">NLTK</a>, movie reviews
</p><p>fileid: pos/cv998_14111.txt
</p><p>steven spielberg ' s second epic film on world war ii is an unquestioned masterpiece of film . spielberg , ever the student on film , has managed to resurrect the war genre by producing one of its grittiest , and most powerful entries . he also managed to cast this era ' s greatest answer to jimmy stewart , tom hanks , who delivers a performance that is nothing short of an astonishing miracle . for about 160 out of its 170 minutes , " saving private ryan " is flawless . literally . the plot is simple enough . after the epic d - day invasion ( whose sequences are nothing short of spectacular ) , capt . john miller ( hanks ) and his team are forced to search for a pvt . james ryan ( damon ) , whose brothers have all died in battle . once they find him , they are to bring him back for immediate discharge so that he can go home . accompanying miller are his crew , played with astonishing perfection by a group of character actors that are simply sensational . barry pepper , adam goldberg , vin diesel , giovanni ribisi , davies , and burns are the team sent to find one man , and bring him home . the battle sequences that bookend the film are extraordinary . literally .
</p><h2 id="the-ouroboros-of-machine-learning"><span class="mw-headline" id="The_ouroboros_of_machine_learning">The ouroboros of machine learning</span></h2><p><a class="external text" href="https://en.wikipedia.org" rel="nofollow">Wikipedia</a> has become a source for learning not only for humans, but also for machines. Its articles are prime sources for training models. But very often, the material the machines are trained on is the same content that they helped to write. In fact, at the beginning of Wikipedia, many articles were written by bots. Rambot, for example, was a controversial bot figure on the English-speaking platform. It authored 98 per cent of the pages describing US towns.
</p><p>As a result of serial and topical robot interventions, the models that are trained on the full Wikipedia dump have a unique view on composing articles. For example, a topic model trained on all of Wikipedia articles will associate 'river' with 'Romania' and 'village' with 'Turkey'. This is because there are over 10000 pages written about villages in Turkey. This should be enough to spark anyone's desire for a visit, but it is far too much compared to the number of articles other countries have on the subject. The asymmetry causes a false correlation and needs to be redressed. Most models try to exclude the work of these prolific robot writers.
</p><h5 id="reference"><span class="mw-headline" id="Reference_2">Reference</span></h5><p><a class="external free" href="https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/" rel="nofollow">https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/</a>
</p></section></section>
<h3 id="readers"><span class="mw-headline" id="Readers">Readers</span></h3>
<p>We communicate with computers through language. We click on icons that have a description in words, we tap words on keyboards, use our voice to give them instructions. Sometimes we trust our computer with our most intimate thoughts and forget that they are extensive calculators. A computer understands every word as a combination of zeros and ones. A letter is read as a specific ASCII number: capital 'A' is 001.
</p><p>In all models, rule-based, classical machine learning, and neural networks, words undergo some type of translation into numbers in order to understand the semantic meaning of language. This is done through counting. Some models count the frequency of single words, some might count the frequency of combinations of words, some count the frequency of nouns, adjectives, verbs or noun and verb phrases. Some just replace the words in a text by their index numbers. Numbers optimize the operative speed of computer processes, leading to fast predictions, but they also remove the symbolic links that words might have. Here we present a few techniques that are dedicated to making text readable to a machine.
</p><p><br/>
</p>
<h5 id="works"><span class="mw-headline" id="Works_5">Works</span></h5>
<section class="group"><section class="lemma the-book-of-tomorrow-in-a-bag-of-words works"><h3 class="lemmaheader" id="the-book-of-tomorrow-in-a-bag-of-words">The Book of Tomorrow in a Bag of Words</h3><p>by Algolit
</p><p>The bag-of-words model is a simplifying representation of text used in Natural Language Processing (NLP). In this model, a text is represented as a collection of its unique words, disregarding grammar, punctuation and even word order. The model transforms the text into a list of words and how many times they're used in the text, or quite literally a bag of words.
</p><p>This heavy reduction of language was the big shock when beginning to machine learn. Bag of words is often used as a baseline, on which the new model has to perform better. It can understand the subject of a text by recognizing the most frequent or important words. It is often used to measure the similarities of texts by comparing their bags of words.
</p><p>For this work the article 'Le Livre de Demain' by engineer G. Vander Haeghen, published in 1907 in the <i>Bulletin de l'Institut International de Bibliographie</i> of the Mundaneum, has been literally reduced to a bag of words. You can buy a bag at the reception of Mundaneum.
</p><hr/><p>Concept &amp; realisation: An Mertens
</p></section><section class="lemma tf-idf works"><h3 class="lemmaheader" id="tf-idf">TF-IDF</h3><p>by Algolit
</p><p>The TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting method used in text search. This statistical measure makes it possible to evaluate the importance of a term contained in a document, relative to a collection or corpus of documents. The weight increases in proportion to the number of occurrences of the word in the document. It also varies according to the frequency of the word in the corpus. The TF-IDF is used in particular in the classification of spam in email softwares.
</p><p>A web-based interface shows this algorithm through animations making it possible to understand the different steps of text classification. How does a TF-IDF-based programme read a text? How does it transform words into numbers?
</p><hr/><p>Concept, code, animation: Sarah Garcin
</p></section><section class="lemma growing-a-tree works"><h3 class="lemmaheader" id="growing-a-tree">Growing a tree</h3><p>by Algolit
</p><p>Parts-of-Speech is a category of words that we learn at school: noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, and sometimes numeral, article, or determiner.
</p><p>In Natural Language Processing (NLP) there exist many writings that allow sentences to be parsed. This means that the algorithm can determine the part-of-speech of each word in a sentence. 'Growing a tree' uses this techniques to define all nouns in a specific sentence. Each noun is then replaced by its definition. This allows the sentence to grow autonomously and infinitely. The recipe of 'Growing a tree' was inspired by Oulipo's constraint of '<a class="external text" href="http://oulipo.net/fr/contraintes/litterature-definitionnelle" rel="nofollow"><i>littérature définitionnelle</i></a>', invented by Marcel Benabou in 1966. In a given phrase, one replaces every significant element (noun, adjective, verb, adverb) by one of its definitions in a given dictionary ; one reiterates the operation on the newly received phrase, and again.
</p><p>The dictionary of definitions used in this work is Wordnet. <a class="external text" href="https://wordnet.princeton.edu/" rel="nofollow">Wordnet</a> is a combination of a dictionary and a thesaurus that can be read by machines. According to <a class="external text" href="https://en.wikipedia.org/wiki/WordNet" rel="nofollow">Wikipedia</a> it was created in the Cognitive Science Laboratory of Princeton University starting in 1985. The project was initially funded by the US Office of Naval Research and later also by other US government agencies including DARPA, the National Science Foundation, the Disruptive Technology Office (formerly the Advanced Research and Development Activity), and REFLEX.
</p><hr/><p>Concept, code &amp; interface: An Mertens &amp; Gijs de Heij
</p></section><section class="lemma algorithmic-readings-of-bertillons-portrait-parlé works"><h3 class="lemmaheader" id="algorithmic-readings-of-bertillons-portrait-parlé">Algorithmic readings of Bertillon's portrait parlé</h3><p>by Guillaume Slizewicz (Urban Species)
</p><p>Written in 1907, <i>Un code télégraphique du portrait parlé</i> is an attempt to translate the 'spoken portrait', a face-description technique created by a policeman in Paris, into numbers. By implementing this code, it was hoped that faces of criminals and fugitives could easily be communicated over the telegraphic network in between countries. In its form, content and ambition this text represents our complicated relationship with documentation technologies. This text sparked the creation of the following installations for three reasons:
</p><p>- First, the text is an algorithm in itself, a compression algorithm, or to be more precise, the presentation of a compression algorithm. It tries to reduce the information to smaller pieces while keeping it legible for the person who has the code. In this regard it is linked to the way we create technology, our pursuit for more efficiency, quicker results, cheaper methods. It represents our appetite for putting numbers on the entire world, measuring the smallest things, labeling the tiniest differences. This text itself embodies the vision of the Mundaneum.
</p><p>- Second it is about the reasons for and the applications of technology. It is almost ironic that this text was in the selected archives presented to us in a time when face recognition and data surveillance are so much in the news. This text bears the same characteristics as some of today's technology: motivated by social control, classifying people, laying the basis for a surveillance society. Facial features are at the heart of recent controversies: mugshots were standardized by Bertillon, now they are used to train neural network to predict criminals from law-abiding citizens. Facial recognition systems allow the arrest of criminals via CCTV infrastructure and some assert that peoples features can predict sexual orientation.
</p><p>- The last point is about how it represents the evolution of mankinds techno-structure. What our tools allow us to do, what they forbid, what they hinder, what they make us remember and what they make us forget. This document enables a classification between people and a certain vision of what normality is. It breaks the continuum into pieces thus allowing stigmatization/discrimination. On the other hand this document also feels obsolete today, because our techno-structure does not need such detailed written descriptions about fugitives, criminals or citizens. We can now find fingerprints, iris scans or DNA info in large datasets and compare them directly. Sometimes the technological systems do not even need human supervision and recognize directly the identity of a person via their facial features or their gait. Computers do not use intricate written language to describe a face, but arrays of integers. Hence all the words used in this documents seem <i>désuets</i>, dated. Have we forgotten what some of them mean? Did photography make us forget how to describe faces? Will voice-assistance software teach us again?
</p><p><i>Writing with Otlet</i>
</p><p>Writing with Otlet is a character generator that uses the spoken portrait code as its database. Random numbers are generated and translated into a set of features. By creating unique instances, the algorithm reveals the richness of the description that is possible with the portrait code while at the same time embodying its nuances.
</p><p><i>An interpretation of Bertillon's spoken portrait.</i>
</p><p>This work draws a parallel between Bertillon systems and current ones. A webcam linked to a facial recognition algorithm captures the beholder's face and translates it into numbers on a canvas, printing it alongside Bertillon's labelled faces.
</p><h5 id="references"><span class="mw-headline" id="References">References</span></h5><p><a class="external free" href="https://www.technologyreview.com/s/602955/neural-network-learns-to-identify-criminals-by-their-faces/" rel="nofollow">https://www.technologyreview.com/s/602955/neural-network-learns-to-identify-criminals-by-their-faces/</a>
<a class="external free" href="https://fr.wikipedia.org/wiki/Bertillonnage" rel="nofollow">https://fr.wikipedia.org/wiki/Bertillonnage</a>
<a class="external free" href="https://callingbullshit.org/case_studies/case_study_criminal_machine_learning.html" rel="nofollow">https://callingbullshit.org/case_studies/case_study_criminal_machine_learning.html</a>
</p></section><section class="lemma hangman works"><h3 class="lemmaheader" id="hangman">Hangman</h3><p>by Laetitia Trozzi, student Arts²/Section Digital Arts
</p><p>What better way to discover Paul Otlet and his passion for literature than to play hangman? Through this simple game, which consists in guessing the missing letters in a word, the goal is to make the public discover terms and facts related to one of the creators of the Mundaneum.
</p><p>Hangman uses an algorithm to detect the frequency of words in a text. Next, a series of significant words were isolated in Paul Otlet's bibliography. This series of words is integrated into a hangman game presented in a terminal. The difficulty of the game gradually increases as the player is offered longer and longer words. Over the different game levels, information about the life and work of Paul Otlet is displayed.
</p></section></section>
<section class="group"><section class="lemma contextual-stories-about-readers stories"><h3 class="lemmaheader" id="contextual-stories-about-readers">Contextual stories about Readers</h3><p><br/>
Naive Bayes, Support Vector Machines and Linear Regression are called classical machine learning algorithms. They perform well when learning with small datasets. But they often require complex Readers. The task the Readers do, is also called feature-engineering. This means that a human needs to spend time on a deep exploratory data analysis of the dataset.
</p><p>Features can be the frequency of words or letters, but also syntactical elements like nouns, adjectives, or verbs. The most significant features for the task to be solved, must be carefully selected and passed over to the classical machine learning algorithm. This process marks the difference with Neural Networks. When using a neural network, there is no need for feature-engineering. Humans can pass the data directly to the network and achieve fairly good performances straightaway. This saves a lot of time, energy and money.
</p><p>The downside of collaborating with Neural Networks is that you need a lot more data to train your prediction model. Think of 1GB or more of plain text files. To give you a reference, 1 A4, a text file of 5000 characters only weighs 5 KB. You would need 8,589,934 pages. More data also requires more access to useful datasets and more, much more processing power.
</p><div class="toc" id="toc"><div id="toctitle"><h2 id="contents">Contents</h2></div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Character_n-gram_for_authorship_recognition"><span class="tocnumber">1</span> <span class="toctext">Character n-gram for authorship recognition</span></a>
<ul>
<li class="toclevel-2 tocsection-2"><a href="#Reference"><span class="tocnumber">1.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-3"><a href="#A_history_of_n-grams"><span class="tocnumber">2</span> <span class="toctext">A history of n-grams</span></a></li>
<li class="toclevel-1 tocsection-4"><a href="#God_in_Google_Books"><span class="tocnumber">3</span> <span class="toctext">God in Google Books</span></a></li>
<li class="toclevel-1 tocsection-5"><a href="#Grammatical_features_taken_from_Twitter_influence_the_stock_market"><span class="tocnumber">4</span> <span class="toctext">Grammatical features taken from Twitter influence the stock market</span></a>
<ul>
<li class="toclevel-2 tocsection-6"><a href="#Reference_2"><span class="tocnumber">4.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-7"><a href="#Bag_of_words"><span class="tocnumber">5</span> <span class="toctext">Bag of words</span></a></li>
</ul>
</div><h2 id="character-n-gram-for-authorship-recognition"><span class="mw-headline" id="Character_n-gram_for_authorship_recognition">Character n-gram for authorship recognition</span></h2><p>Imagine … You've been working for a company for more than ten years. You have been writing tons of emails, papers, internal notes and reports on very different topics and in very different genres. All your writings, as well as those of your colleagues, are safely backed-up on the servers of the company.
</p><p>One day, you fall in love with a colleague. After some time you realize this human is rather mad and hysterical and also very dependent on you. The day you decide to break up, your (now) ex elaborates a plan to kill you. They succeed. This is unfortunate. A suicide letter in your name is left next to your corpse. Because of emotional problems, it says, you decided to end your life. Your best friends don't believe it. They decide to take the case to court. And there, based on the texts you and others produced over ten years, a machine learning model reveals that the suicide letter was written by someone else.
</p><p>How does a machine analyse texts in order to identify you? The most robust feature for authorship recognition is delivered by the character n-gram technique. It is used in cases with a variety of thematics and genres of the writing. When using character n-grams, texts are considered as sequences of characters. Let's consider the character trigram. All the overlapping sequences of three characters are isolated. For example, the character 3-grams of 'Suicide', would be, Sui, uic, ici, cid, etc. Character n-gram features are very simple, they're language-independent and they're tolerant to noise. Furthermore, spelling mistakes do not jeopardize the technique.
</p><p>Patterns found with character n-grams focus on stylistic choices that are unconsciously made by the author. The patterns remain stable over the full length of the text, which is important for authorship recognition. Other types of experiments could include measuring the length of words or sentences, the vocabulary richness, the frequencies of function words; even syntax or semantics-related measurements.
</p><p>This means that not only your physical fingerprint is unique, but also the way you compose your thoughts!
</p><p>The same n-gram technique discovered that <i>The Cuckoos Calling</i>, a novel by Robert Galbraith, was actually written by … J. K. Rowling!
</p><h5 id="reference"><span class="mw-headline" id="Reference">Reference</span></h5><ul><li> Paper: <a class="external text" href="https://brooklynworks.brooklaw.edu/cgi/viewcontent.cgi?article=1048&amp;context=jlp" rel="nofollow">On the Robustness of Authorship Attribution Based on Character N-gram Features</a>, Efstathios Stamatatos, in <i>Journal of Law &amp; Policy</i>, Volume 21, Issue 2, 2013.</li>
<li> News article: <a class="external free" href="https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/" rel="nofollow">https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/</a></li></ul><h2 id="a-history-of-n-grams"><span class="mw-headline" id="A_history_of_n-grams">A history of n-grams</span></h2><p>The n-gram algorithm can be traced back to the work of Claude Shannon in information theory. In the paper, '<a class="external text" href="http://www.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf" rel="nofollow">A Mathematical Theory of Communication</a>', published in 1948, Shannon performed the first instance of an n-gram-based model for natural language. He posed the question: given a sequence of letters, what is the likelihood of the next letter?
</p><p>If you read the following excerpt, can you tell who it was written by? Shakespeare or an n-gram piece of code?
</p><p>SEBASTIAN: Do I stand till the break off.
</p><p>BIRON: Hide thy head.
</p><p>VENTIDIUS: He purposeth to Athens: whither, with the vow
I made to handle you.
</p><p>FALSTAFF: My good knave.
</p><p>You may have guessed, considering the topic of this story, that an n-gram algorithm generated this text. The model is trained on the compiled works of Shakespeare. While more recent algorithms, such as the recursive neural networks of the CharNN, are becoming famous for their performance, n-grams still execute a lot of NLP tasks. They are used in statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, ...
</p><h2 id="god-in-google-books"><span class="mw-headline" id="God_in_Google_Books">God in Google Books</span></h2><p>In 2006, Google created a <a class="external text" href="http://storage.googleapis.com/books/ngrams/books/datasetsv2.html" rel="nofollow">dataset of n-grams</a> from their digitized book collection and released it online. Recently they also created an <a class="external text" href="https://books.google.com/ngrams" rel="nofollow">n-gram viewer</a>.
</p><p>This allowed for many socio-linguistic investigations. For example, in October 2018, the <i>New York Times Magazine</i> published an opinion article titled <i><a class="external text" href="https://www.nytimes.com/2018/10/13/opinion/sunday/talk-god-sprituality-christian.html" rel="nofollow">'Its Getting Harder to Talk About God'</a></i>. The author, Jonathan Merritt, had analysed the mention of the word 'God' in Google's dataset using the n-gram viewer. He concluded that there had been a decline in the word's usage since the twentieth century. Google's corpus contains texts from the sixteenth century leading up to the twenty-first. However, what the author missed out on was the growing popularity of scientific journals around the beginning of the twentieth century. This new genre that was not mentioning the word God shifted the dataset. If the scientific literature was taken out of the corpus, the frequency of the word 'God' would again flow like a gentle ripple from a distant wave.
</p><h2 id="grammatical-features-taken-from-twitter-influence-the-stock-market"><span class="mw-headline" id="Grammatical_features_taken_from_Twitter_influence_the_stock_market">Grammatical features taken from Twitter influence the stock market</span></h2><p>The boundaries between academic disciplines are becoming blurred. Economics research mixed with psychology, social science, cognitive and emotional concepts have given rise to a new economics subfield, called 'behavioral economics'. This means that researchers can start to explain stock market mouvement based on factors other than economic factors only. Both the economy and 'public opinion' can influence or be influenced by each other. A lot of research is being done on how to use 'public opinion' to predict tendencies in stock-price changes.
</p><p>'Public opinion' is estimated from sources of large amounts of public data, like tweets, blogs or online news. Research using machinic data analysis shows that the changes in stock prices can be predicted by looking at 'public opinion', to some degree. There are many scientific articles online, which analyse the press on the 'sentiment' expressed in them. An article can be marked as more or less positive or negative. The annotated press articles are then used to train a machine learning model, which predicts stock market trends, marking them as 'down' or 'up'. When a company gets bad press, traders sell. On the contrary, if the news is good, they buy.
</p><p>A paper by Haikuan Liu of the Australian National University states that the tense of verbs used in tweets can be an indicator of the frequency of financial transactions. His idea is based on the fact that verb conjugation is used in psychology to detect the early stages of human depression.
</p><h5 id="reference"><span class="mw-headline" id="Reference_2">Reference</span></h5><p>Paper: <a class="external text" href="http://courses.cecs.anu.edu.au/courses/CSPROJECTS/18S1/reports/u6013799.pdf" rel="nofollow">'Grammatical Feature Extraction and Analysis of Tweet Text: An Application towards Predicting Stock Trends'</a>, Haikuan Liu, Research School of Computer Science (RSCS), College of Engineering and Computer Science (CECS), The Australian National University (ANU)
</p><h2 id="bag-of-words"><span class="mw-headline" id="Bag_of_words">Bag of words</span></h2><p>In Natural Language Processing (NLP), 'bag of words' is considered to be an unsophisticated model. It strips text of its context and dismantles it into a collection of unique words. These words are then counted. In the previous sentences, for example, 'words' is mentioned three times, but this is not necessarily an indicator of the text's focus.
</p><p>The first appearance of the expression 'bag of words' seems to go back to 1954. <a class="external text" href="https://en.wikipedia.org/wiki/Zellig_Harris" rel="nofollow">Zellig Harris</a>, an influential linguist, published a paper called 'Distributional Structure'. In the section called 'Meaning as a function of distribution', he says 'for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use. The linguist's work is precisely to discover these properties, whether for descriptive analysis or for the synthesis of quasi-linguistic systems.'
</p></section></section>
<h3 id="learners"><span class="mw-headline" id="Learners">Learners</span></h3>
<p>Learners are the algorithms that distinguish machine learning practices from other types of practices. They are pattern finders, capable of crawling through data and generating some kind of specific 'grammar'. Learners are based on statistical techniques. Some need a large amount of training data in order to function, others can work with a small annotated set. Some perform well in classification tasks, like spam identification, others are better at predicting numbers, like temperatures, distances, stock market values, and so on.
</p><p>The terminology of machine learning is not yet fully established. Depending on the field, whether statistics, computer science or the humanities, different terms are used. Learners are also called classifiers. When we talk about Learners, we talk about the interwoven functions that have the capacity to generate other functions, evaluate and readjust them to fit the data. They are good at understanding and revealing patterns. But they don't always distinguish well which of the patterns should be repeated.
</p><p>In software packages, it is not always possible to distinguish the characteristic elements of the classifiers, because they are hidden in underlying modules or libraries. Programmers can invoke them using a single line of code. For this exhibition, we therefore developed two table games that show in detail the learning process of simple, but frequently used classifiers.
</p>
<h5 id="works"><span class="mw-headline" id="Works_6">Works</span></h5>
<section class="group"><section class="lemma naive-bayes-game works"><h3 class="lemmaheader" id="naive-bayes-game">Naive Bayes game</h3><p>by Algolit
</p><p>In machine learning Naive Bayes methods are simple probabilistic classifiers that are widely applied for spam filtering and deciding whether a text is positive or negative.
</p><p>They require a small amount of training data to estimate the necessary parameters. They can be extremely fast compared to more sophisticated methods. They are difficult to generalize, which means that they perform on specific tasks, demanding to be trained with the same style of data that will be used to work with afterwards.
</p><p>This game allows you to play along the rules of Naive Bayes. While manually executing the code, you create your own playful model that 'just works'. A word of caution is necessary: because you only train it with 6 sentences instead of the minimum 2000 it is not representative at all!
</p><hr/><p>Concept &amp; realisation: An Mertens
</p></section><section class="lemma linear-regression-game works"><h3 class="lemmaheader" id="linear-regression-game">Linear Regression game</h3><p>by Algolit
</p><p>Linear Regression is one of the best-known and best-understood algorithms in statistics and machine learning. It has been around for almost 200 years. It is an attractive model because the representation is so simple. In statistics, linear regression is a statistical method that allows to summarize and study relationships between two continuous (quantitative) variables.
</p><p>By playing this game you will realize that as a player you have a lot of decisions to make. You will experience what it means to create a coherent dataset, to decide what is in and what is not in. If all goes well, you will feel the urge to change your data in order to obtain better results. This is part of the art of approximation that is at the basis of all machine learning practices.
</p><hr/><p>Concept &amp; realisation: An Mertens
</p></section><section class="lemma traité-de-documentation works"><h3 class="lemmaheader" id="traité-de-documentation">Traité de documentation</h3><p>Traité de Documentation. Three algorithmic poems.
</p><p>by Rémi Forte, designer-researcher at LAtelier national de recherche typographique, Nancy, France
</p><p>serigraphy on paper, 60 × 80 cm, 25 ex., 2019, for sale at the reception of the Mundaneum.
</p><p>The poems, reproduced in the form of three posters, are an algorithmic and poetic re-reading of Paul Otlet's <i>Traité de documentation</i>. They are the result of an algorithm based on the mysterious rules of human intuition. It has been applied to a fragment taken from Paul Otlet's book and is intended to be representative of his bibliological practice.
</p><p>For each fragment, the algorithm splits the text, words and punctuation marks are counted and reordered into a list. In each line, the elements combine and exhaust the syntax of the selected fragment. Paul Otlet's language remains perceptible but exacerbated to the point of absurdity. For the reader, the systematization of the text is disconcerting and his reading habits are disrupted.
</p><p>Built according to a mathematical equation, the typographical composition of the poster is just as systematic as the poem. However, friction occurs occasionally; loop after loop, the lines extend to bite on the neighbouring column. Overlays are created and words are hidden by others. These telescopic handlers draw alternative reading paths.
</p></section></section>
<section class="group"><section class="lemma contextual-stories-about-learners stories"><h3 class="lemmaheader" id="contextual-stories-about-learners">Contextual stories about Learners</h3><div class="toc" id="toc"><div id="toctitle"><h2 id="contents">Contents</h2></div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#Naive_Bayes_.26_Viagra"><span class="tocnumber">1</span> <span class="toctext">Naive Bayes &amp; Viagra</span></a>
<ul>
<li class="toclevel-2 tocsection-2"><a href="#Reference"><span class="tocnumber">1.1</span> <span class="toctext">Reference</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-3"><a href="#Naive_Bayes_.26_Enigma"><span class="tocnumber">2</span> <span class="toctext">Naive Bayes &amp; Enigma</span></a></li>
<li class="toclevel-1 tocsection-4"><a href="#A_story_about_sweet_peas"><span class="tocnumber">3</span> <span class="toctext">A story about sweet peas</span></a>
<ul>
<li class="toclevel-2 tocsection-5"><a href="#References"><span class="tocnumber">3.1</span> <span class="toctext">References</span></a></li>
</ul>
</li>
<li class="toclevel-1 tocsection-6"><a href="#Perceptron"><span class="tocnumber">4</span> <span class="toctext">Perceptron</span></a></li>
<li class="toclevel-1 tocsection-7"><a href="#BERT"><span class="tocnumber">5</span> <span class="toctext">BERT</span></a></li>
</ul>
</div><h2 id="naive-bayes--viagra"><span class="mw-headline" id="Naive_Bayes_.26_Viagra">Naive Bayes &amp; Viagra</span></h2><p><a class="external text" href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier" rel="nofollow">Naive Bayes</a> is a famous learner that performs well with little data. We apply it all the time. Christian and Griffiths state in their book, <a class="external text" href="http://algorithmstoliveby.com/" rel="nofollow"><i>Algorithms To Live By</i></a>, that 'our days are full of small data'. Imagine, for example, that you're standing at a bus stop in a foreign city. The other person who is standing there has been waiting for 7 minutes. What do you do? Do you decide to wait? And if so, for how long? When will you initiate other options? Another example. Imagine a friend asking advice about a relationship. He's been together with his new partner for a month. Should he invite the partner to join him at a family wedding?
</p><p>Having pre-existing beliefs is crucial for Naive Bayes to work. The basic idea is that you calculate the probabilities based on prior knowledge and given a specific situation.
</p><p>The theorem was formulated during the 1740s by <a class="external text" href="https://en.wikipedia.org/wiki/Thomas_Bayes" rel="nofollow">Thomas Bayes</a>, a reverend and amateur mathematician. He dedicated his life to solving the question of how to win the lottery. But Bayes' rule was only made famous and known as it is today by the mathematician <a class="external text" href="https://en.wikipedia.org/wiki/Pierre-Simon_Laplace" rel="nofollow">Pierre Simon Laplace</a> in France a bit later in the same century. For a long time after La Place's death, the theory sank into oblivion until it was dug up again during the Second World War in an effort to break the Enigma code.
</p><p>Most people today have come in contact with Naive Bayes through their email spam folders. Naive Bayes is a widely used algorithm for spam detection. It is by coincidence that Viagra, the erectile dysfunction drug, was approved by the US Food &amp; Drug Administration in 1997, around the same time as about 10 million users worldwide had made free webmail accounts. The selling companies were among the first to make use of email as a medium for advertising: it was an intimate space, at the time reserved for private communication, for an intimate product. In 2001, the first <a class="external text" href="https://spamassassin.apache.org/" rel="nofollow">SpamAssasin</a> programme relying on Naive Bayes was uploaded to <a class="external text" href="https://sourceforge.net/" rel="nofollow">SourceForge</a>, cutting down on guerilla email marketing.
</p><h5 id="reference"><span class="mw-headline" id="Reference">Reference</span></h5><p><i>Machine Learners</i>, by Adrian MacKenzie, MIT Press, Cambridge, US, November 2017.
</p><h2 id="naive-bayes--enigma"><span class="mw-headline" id="Naive_Bayes_.26_Enigma">Naive Bayes &amp; Enigma</span></h2><p>This story about Naive Bayes is taken from the book '<a class="external text" href="https://yalebooks.yale.edu/book/9780300188226/theory-would-not-die" rel="nofollow"><i>The Theory That Would Not Die</i></a>', written by Sharon Bertsch McGrayne. Among other things, she describes how Naive Bayes was soon forgotten after the death of <a class="external text" href="https://en.wikipedia.org/wiki/Pierre-Simon_Laplace" rel="nofollow">Pierre Simon Laplace</a>, its inventor. The mathematician was said to have failed to credit the works of others. Therefore, he suffered widely circulated charges against his reputation. Only after 150 years was the accusation refuted.
</p><p>Fast forward to 1939, when Bayes' rule was still virtually taboo, dead and buried in the field of statistics. When France was occupied in 1940 by Germany, which controlled Europe's factories and farms, Winston Churchill's biggest worry was the U-boat peril. U-boat operations were tightly controlled by German headquarters in France. Each submarine received orders as coded radio messages long after it was out in the Atlantic. The messages were encrypted by word-scrambling machines, called Enigma machines. <a class="external text" href="https://en.wikipedia.org/wiki/Enigma_machine" rel="nofollow">Enigma</a> looked like a complicated typewriter. It was invented by the German firm Scherbius &amp; Ritter after the First World War, when the need for message-encoding machines had become painfully obvious.
</p><p>Interestingly, and luckily for Naive Bayes and the world, at that time, the British government and educational systems saw applied mathematics and statistics as largely irrelevant to practical problem-solving. So the British agency charged with cracking German military codes mainly hired men with linguistic skills. Statistical data was seen as bothersome because of its detail-oriented nature. So wartime data was often analysed not by statisticians, but by biologists, physicists, and theoretical mathematicians. None of them knew that the Bayes rule was considered to be unscientific in the field of statistics. Their ignorance proved fortunate.
</p><p>It was the now famous <a class="external text" href="https://en.wikipedia.org/wiki/Alan_Turing" rel="nofollow">Alan Turing</a> a mathematician, computer scientist, logician, cryptoanalyst, philosopher and theoretical biologist who used Bayes' rules probabilities system to design the 'bombe'. This was a high-speed electromechanical machine for testing every possible arrangement that an Enigma machine would produce. In order to crack the naval codes of the U-boats, Turing simplified the 'bombe' system using Baysian methods. It turned the UK headquarters into a code-breaking factory. The story is well illustrated in <a class="external text" href="https://www.imdb.com/title/tt2084970/" rel="nofollow"><i>The Imitation Game</i></a>, a film by Morten Tyldum dating from 2014.
</p><h2 id="a-story-about-sweet-peas"><span class="mw-headline" id="A_story_about_sweet_peas">A story about sweet peas</span></h2><p>Throughout history, some models have been invented by people with ideologies that are not to our liking. The idea of regression stems from Sir <a class="external text" href="https://en.wikipedia.org/wiki/Francis_Galton" rel="nofollow">Francis Galton</a>, an influential nineteenth-century scientist. He spent his life studying the problem of heredity understanding how strongly the characteristics of one generation of living beings manifested themselves in the following generation. He established the field of eugenics, defining it as the study of agencies under social control that may improve or impair the racial qualities of future generations, either physically or mentally'. On Wikipedia, Galton is a prime example of scientific racism.
Galton initially approached the problem of heredity by examining characteristics of the sweet pea plant. He chose this plant because the species can self-fertilize. Daughter plants inherit genetic variations from mother plants without a contribution from a second parent. This characteristic eliminates having to deal with multiple sources.
</p><p>Galton's research was appreciated by many intellectuals of his time. In 1869, in <a class="external text" href="http://galton.org/books/hereditary-genius/text/pdf/galton-1869-genius-v4.pdf" rel="nofollow"><i>Hereditary Genius</i></a>, Galton claimed that genius is mainly a matter of ancestry and he believed that there was a biological explanation for social inequality across races. Galton even influenced his half-cousin <a class="external text" href="https://en.wikipedia.org/wiki/Charles_Darwin" rel="nofollow">Charles Darwin</a> with his ideas. After reading Galton's paper, Darwin stated, 'You have made a convert of an opponent in one sense for I have always maintained that, excepting fools, men did not differ much in intellect, only in zeal and hard work'. Luckily, the modern study of heredity managed to eliminate the myth of race-based genetic difference, something Galton tried hard to maintain.
</p><p>Galton's major contribution to the field was linear regression analysis, laying the groundwork for much of modern statistics. While we engage with the field of machine learning, Algolit tries not to forget that ordering systems hold power, and that this power has not always been used to the benefit of everyone. Machine learning has inherited many aspects of statistical research, some less agreeable than others. We need to be attentive, because these world views do seep into the algorithmic models that create new orders.
</p><h5 id="references"><span class="mw-headline" id="References">References</span></h5><p><a class="external free" href="http://galton.org/letters/darwin/correspondence.htm" rel="nofollow">http://galton.org/letters/darwin/correspondence.htm</a>
<a class="external free" href="https://www.tandfonline.com/doi/full/10.1080/10691898.2001.11910537" rel="nofollow">https://www.tandfonline.com/doi/full/10.1080/10691898.2001.11910537</a>
<a class="external free" href="http://www.paramoulipist.be/?p=1693" rel="nofollow">http://www.paramoulipist.be/?p=1693</a>
</p><h2 id="perceptron"><span class="mw-headline" id="Perceptron">Perceptron</span></h2><p>We find ourselves in a moment in time in which neural networks are sparking a lot of attention. But they have been in the spotlight before. The study of neural networks goes back to the 1940s, when the first neuron metaphor emerged. The neuron is not the only biological reference in the field of machine learning - think of the word corpus or training. The artificial neuron was constructed in close connection to its biological counterpart.
</p><p>Psychologist <a class="external text" href="https://en.wikipedia.org/wiki/Frank_Rosenblatt" rel="nofollow">Frank Rosenblatt</a> was inspired by fellow psychologist <a class="external text" href="https://en.wikipedia.org/wiki/Donald_O._Hebb" rel="nofollow">Donald Hebb</a>'s work on the role of neurons in human learning. Hebb stated that 'cells that fire together wire together'. His theory now lies at the basis of associative human learning, but also unsupervised neural network learning. It moved Rosenblatt to expand on the idea of the artificial neuron.
</p><p>In 1962, he created the Perceptron, a model that learns through the weighting of inputs. It was set aside by the next generation of researchers, because it can only handle binary classification. This means that the data has to be clearly separable, as for example, men and women, black and white. It is clear that this type of data is very rare in the real world. When the so-called first AI winter arrived in the 1970s and the funding decreased, the Perceptron was also neglected. For ten years it stayed dormant. When spring settled at the end of the 1980s, a new generation of researchers picked it up again and used it to construct neural networks. These contain multiple layers of Perceptrons. That is how neural networks saw the light. One could say that the current machine learning season is particularly warm, but it takes another winter to know a summer.
</p><h2 id="bert"><span class="mw-headline" id="BERT">BERT</span></h2><p>Some online articles say that the year 2018 marked a turning point for the field of Natural Language Processing (NLP). A series of deep-learning models achieved state-of-the-art results on tasks like question-answering or sentiment-classification. Googles BERT algorithm entered the machine learning competitions of last year as a sort of 'one model to rule them all'. It showed a superior performance over a wide variety of tasks.
</p><p>BERT is pre-trained; its weights are learned in advance through two unsupervised tasks. This means BERT doesnt need to be trained from scratch for each new task. You only have to finetune its weights. This also means that a programmer wanting to use BERT, does not know any longer what parameters BERT is tuned to, nor what data it has seen to learn its performances.
</p><p>BERT stands for Bidirectional Encoder Representations from Transformers. This means that BERT allows for bidirectional training. The model learns the context of a word based on all of its surroundings, left and right of a word. As such, it can differentiate between 'I accessed the bank account' and 'I accessed the bank of the river'.
</p><p>Some facts:
- BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with 'only' 110 million parameters.
- to run BERT you need to use TPUs. These are the Google's processors (CPUs) especially engineered for TensorFLow, the deep-learning platform. TPU's renting rates range from $8/hr till $394/hr. Algolit doesn't want to work with off-the-shelf packages, we are interested in opening up the blackbox. In that case, BERT asks for quite some savings in order to be used.
</p></section></section>
<h2 id="glossary"><span class="mw-headline" id="Glossary">Glossary</span></h2>
<p>This is a non-exhaustive wordlist, based on terms that are frequently used in the exhibition. It might help visitors who are not familiar with the vocabulary related to the field of Natural Language Processing (NLP), Algolit or the Mundaneum.
</p><p><b>* Algolit:</b> A group from Brussels involved in artistic research on algorithms and literature. Every month they gather to experiment with code and texts that are published under free licenses. <a class="external free" href="http://www.algolit.net" rel="nofollow">http://www.algolit.net</a>
</p><p><b>* Algoliterary:</b> Word invented by Algolit for works that explore the point of view of the algorithmic storyteller. What kind of new forms of storytelling do we make possible in dialogue with machinic agencies?
</p><p><b>* Algorithm:</b> A set of instructions in a specific programming language, that takes an input and produces an output.
</p><p><b>* Annotation:</b> The annotation process is a crucial step in supervised machine learning where the algorithm is given examples of what it needs to learn. A spam filter in training will be fed examples of spam and real messages. These examples are entries, or rows from the dataset with a label, spam or non-spam. The labelling of a dataset is work executed by humans, they pick a label for each row of the dataset. To ensure the quality of the labels multiple annotators see the same row and have to give the same label before an example is included in the training data.
</p><p><b>* AI or artificial intelligences:</b> In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. Computer science defines AI research as the study of intelligent agents. Any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. More specifically, Kaplan and Haenlein define AI as a systems ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. Colloquially, the term artificial intelligence is used to describe machines that mimic cognitive functions that humans associate with other human minds, such as learning and problem solving. (Wikipedia)
</p><p><b>* Bag of Words:</b> The bag-of-words model is a simplifying representation of text used in Natural Language Processing (NLP). In this model, a text is represented as a collection of its unique words, disregarding grammar, punctuation and even word order. The model transforms the text into a list of words and how many times they're used in the text, or quite literally a bag of words. Bag of words is often used as a baseline, on which the new model has to perform better.
</p><p><b>* Character n-gram:</b> A technique that is used for authorship recognition. When using character n-grams, texts are considered as sequences of characters. Let's consider the character trigram. All the overlapping sequences of three characters are isolated. For example, the character 3-grams of 'Suicide', would be, 'Sui', 'uic', 'ici', 'cid' etc. Patterns found with character n-grams focus on stylistic choices that are unconsciously made by the author. The patterns remain stable over the full length of the text.
</p><p><b>* Classical Machine Learning:</b> Naive Bayes, Support Vector Machines and Linear Regression are called classical machine learning algorithms. They perform well when learning with small datasets. But they often require complex Readers. The task the Readers do, is also called feature-engineering (see below). This means that a human needs to spend time on a deep exploratory data analysis of the dataset.
</p><p><b>* Constant:</b> Constant is a non-profit, artist-run organisation based in Brussels since 1997 and active in the fields of art, media and technology. Algolit started as a project of Constant in 2012. <a class="external free" href="http://constantvzw.org" rel="nofollow">http://constantvzw.org</a>
</p><p><b>* Data workers:</b> Artificial intelligences that are developed to serve, entertain, record and know about humans. The work of these machinic entities is usually hidden behind interfaces and patents. In the exhibition, algorithmic storytellers leave their invisible underworld to become interlocutors.
</p><p><b>* Dump:</b> According to the English dictionary, a dump is an accumulation of refused and discarded materials or the place where such materials are dumped. In computing a dump refers to a database dump, a record of data from a database used for easy downloading or for backing up a database. Database dumps are often published by free software and free content projects, such as Wikipedia, to allow reuse or forking of the database.
</p><p><b>* Feature engineering:</b> The process of using domain knowledge of the data to create features that make machine learning algorithms work. This means that a human needs to spend time on a deep exploratory data analysis of the dataset.
In Natural Language Processing (NLP) features can be the frequency of words or letters, but also syntactical elements like nouns, adjectives, or verbs. The most significant features for the task to be solved, must be carefully selected and passed over to the classical machine learning algorithm.
</p><p><b>* FLOSS or Free Libre Open Source Software:</b> Software that anyone is freely licensed to use, copy, study, and change in any way, and the source code is openly shared so that people are encouraged to voluntarily improve the design of the software. This is in contrast to proprietary software, where the software is under restrictive copyright licensing and the source code is usually hidden from the users. (Wikipedia)
</p><p><b>* git:</b> A software system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Before starting a new project, programmers create a "git repository" in which they will publish all parts of the code. The git repositories of Algolit can be found on <a class="external free" href="https://gitlab.constantvzw.org/algolit" rel="nofollow">https://gitlab.constantvzw.org/algolit</a>.
</p><p><b>* gutenberg.org:</b> Project Gutenberg is an online platform run by volunteers to encourage the creation and distribution of eBooks. It was founded in 1971 by American writer Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 23 June 2018, Project Gutenberg reached 57,000 items in its collection of free eBooks. (Wikipedia)
</p><p><b>* Henri La Fontaine:</b> Henri La Fontaine (1854-1943) is a Belgian politician, feminist and pacifist. He was awarded the Nobel Peace Prize in 1913 for his involvement in the International Peace Bureau and his contribution to the organization of the peace movement. In 1895, together with Paul Otlet, he created the International Bibliography Institute, which became the Mundaneum. Within this institution, which aimed to bring together all the world's knowledge, he contributed to the development of the Universal Decimal Classification (CDU) system.
</p><p><b>* Kaggle:</b> An online platform where users find and publish data sets, explore and build machine learning models, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. About half a million data scientists are active on Kaggle. It was founded by Goldbloom and Ben Hamner in 2010 and acquired by Google in March 2017.
</p><p><b>* Literature:</b> Algolit understands the notion of literature in the way a lot of other experimental authors do. It includes all linguistic production, from the dictionary to the Bible, from Virginia Woolf's entire work to all versions of Terms of Service published by Google since its existence.
</p><p><b>* Machine learning models:</b> Algorithms based on statistics, mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural language processing', in short, 'nlp'. These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their word to say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.
</p><p><b>* Markov Chain:</b> Algorithm that scans the text for the transition probability of letter or word occurrences, resulting in transition probability tables which can be computed even without any semantic or grammatical natural language understanding. It can be used for analyzing texts, but also for recombining them. It is is widely used in spam generation.
</p><p><b>* Mechanical Turk:</b> The Amazon Mechanical Turk is an online platform for humans to execute tasks that algorithms cannot. Examples include annotating sentences as being positive or negative, spotting number plates, discriminating between face and non-face. The jobs posted on this platform are often paid less than a cent per task. Tasks that are more complex or require more knowledge can be paid up to several cents. Many academic researchers use Mechanical Turk as an alternative to have their students execute these tasks.
</p><p><b>* Mundaneum:</b> In the late nineteenth century two young Belgian jurists, Paul Otlet (1868-1944), the father of documentation, and Henri La Fontaine (1854-1943), statesman and Nobel Peace Prize winner, created The Mundaneum. The project aimed at gathering all the worlds knowledge and file it using the Universal Decimal Classification (UDC) system that they had invented.
</p><p><b>* Natural Language:</b> A natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are different from constructed and formal languages such as those used to program computers or to study logic. (Wikipedia)
</p><p><b>* NLP or Natural Language Processing:</b> Natural language processing (NLP) is a collective term referring to automatic computational processing of human languages. This includes algorithms that take human-produced text as input, and attempt to generate text that resembles it.
</p><p><b>* Neural Networks:</b> Computing systems inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. Such systems learn to perform tasks by considering examples, generally without being programmed with any task-specific rules. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as cat or no cat and using the results to identify cats in other images. They do this without any prior knowledge about cats, for example, that they have fur, tails, whiskers and cat-like faces. Instead, they automatically generate identifying characteristics from the learning material that they process. (Wikipedia)
</p><p><b>* Optical Character Recognition (OCR):</b> Computer processes for translating images of scanned texts into manipulable text files.
</p><p><b>* Oracle:</b> Oracles are prediction or profiling machines, a specific type of algorithmic models, mostly based on statistics. They are widely used in smartphones, computers, tablets.
</p><p><b>* Oulipo:</b> Oulipo stands for Ouvroir de litterature potentielle (Workspace for Potential Literature). Oulipo was created in Paris by the French writers Raymond Queneau and François Le Lionnais. They rooted their practice in the European avant-garde of the twentieth century and in the experimental tradition of the 1960s. For Oulipo, the creation of rules becomes the condition to generate new texts, or what they call potential literature. Later, in 1981, they also created <a class="external text" href="http://www.alamo.free.fr/" rel="nofollow">ALAMO</a>, Atelier de littérature assistée par la mathématique et les ordinateurs (Workspace for literature assisted by maths and computers).
</p><p><b>* Paul Otlet:</b> Paul Otlet (1868 1944) was a Belgian author, entrepreneur, visionary, lawyer and peace activist; he is one of several people who have been considered the father of information science, a field he called 'documentation'. Otlet created the Universal Decimal Classification, that was widespread in libraries. Together with Henri La Fontaine he created the Palais Mondial (World Palace), later, the Mundaneum to house the collections and activities of their various organizations and institutes.
</p><p><b>* Python:</b> The main programming language that is globally used for natural language processing, was invented in 1991 by the Dutch programmer Guido Van Rossum.
</p><p><b>* Rule-Based models:</b> Oracles can be created using different techniques. One way is to manually define rules for them. As prediction models they are then called rule-based models, opposed to statistical models. Rule-based models are handy for tasks that are specific, like detecting when a scientific paper concerns a certain molecule. With very little sample data, they can perform well.
</p><p><b>* Sentiment analysis:</b> Also called 'opinion mining'. A basic task in sentiment analysis is classifying a given text as positive, negative, or neutral. Advanced, 'beyond polarity' sentiment classification looks, for instance, at emotional states such as 'angry', 'sad', and 'happy'. Sentiment analysis is widely applied to user materials such as reviews and survey responses, comments and posts on social media, and healthcare materials for applications that range from marketing to customer service, from stock exchange transactions to clinical medicine.
</p><p><b>* Supervised machine learning models:</b> For the creation of supervised machine learning models, humans annotate sample text with labels before feeding it to a machine to learn. Each sentence, paragraph or text is judged by at least 3 annotators: whether it is spam or not spam, positive or negative etc.
</p><p><b>* Training data:</b> Machine learning algorithms need guidance. In order to separate one thing from another, they need texts to extract patterns from. One should carefully choose the training material, and adapt it to the machine's task. It doesn't make sense to train a machine with nineteenth-century novels if its mission is to analyze tweets.
</p><p><b>* Unsupervised Machine Learning Models:</b> Unsupervised machine learning models don't need the step of annotation of the data by humans. This saves a lot of time, energy, money. Instead, they need a large amount of training data, which is not always available and can take a long cleaning time beforehand.
</p><p><b>* Word embeddings:</b> Language modelling techniques that through multiple mathematical operations of counting and ordering, plot words into a multi-dimensional vector space. When embedding words, they transform from being distinct symbols into mathematical objects that can be multiplied, divided, added or substracted.
</p><p><b>* Wordnet:</b> Wordnet is a combination of a dictionary and a thesaurus that can be read by machines. According to Wikipedia it was created in the Cognitive Science Laboratory of Princeton University starting in 1985. The project was initially funded by the US Office of Naval Research and later also by other US government agencies including DARPA, the National Science Foundation, the Disruptive Technology Office (formerly the Advanced Research and Development Activity), and REFLEX.
</p>
<!--
NewPP limit report
Cached time: 20190325070945
Cache expiry: 86400
Dynamic content: false
CPU time usage: 0.072 seconds
Real time usage: 0.073 seconds
Preprocessor visited node count: 63/1000000
Preprocessor generated node count: 68/1000000
Postexpand include size: 0/2097152 bytes
Template argument size: 0/2097152 bytes
Highest expansion depth: 2/40
Expensive parser function count: 0/100
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 0.000 1 - -total
-->
<!-- Saved in parser cache with key algolit-mw_:pcache:idhash:2855-1!*!0!!*!*!* and timestamp 20190325070945 and revision id 11638
-->
</section></body>
</html>