63 lines
4.2 KiB
HTML
63 lines
4.2 KiB
HTML
{% extends "en/base.html" %}
|
||
{% block title %}{% endblock %}
|
||
|
||
{% block search %}
|
||
{% endblock %}
|
||
|
||
{% block results %}
|
||
<div class="cross">
|
||
|
||
<p class="tfidf" style="margin-left: calc(50% + 1.5em);">
|
||
<code>
|
||
def tfidf(query, words, corpus):<br /><br>
|
||
# Term Frequency<br />
|
||
tf_count = 0<br />
|
||
for word in words:<br />
|
||
if query == word:<br />
|
||
tf_count += 1<br />
|
||
tf = tf_count/len(words)<br />
|
||
<br />
|
||
# Inverse Document Frequency<br />
|
||
idf_count = 0<br />
|
||
for words in corpus:<br />
|
||
if query in words:<br />
|
||
idf_count += 1<br />
|
||
<br />
|
||
tfidf_value = tf * idf<br />
|
||
<br />
|
||
return tf_count, tf_count, tfidf_value
|
||
</code>
|
||
</p>
|
||
<br><br>
|
||
|
||
<p class="note">[Note on contrast mappings]</p>
|
||
<p class="tfidf" style="float: right;margin-left:1em;">
|
||
The TF-IDF algorithm, shown above in the programming language Python, weaves a layer of contrast into the text. Not literally, but in the form of numbers. The most contrastful words are those that the algorithm consideres as the most important words for that text.
|
||
<br><br>
|
||
These contrast mappings allow for reading across the manifesto and the algorithm.
|
||
<br><br>
|
||
The TF-IDF values are calculated in two steps. The algorithm first counts the <em>Term Frequency</em> (TF) by counting the appearance of a word in the text, relatively to the total number of words in the document. This way of relative frequency counting makes it possible to compare wordcounts between documents with variating lengths. This makes it possible to compare Donna Harraway's long essay <em><em>A Cyborg Manifesto</em></em> (1984) with the relatively short text of <em><em>The Call for Feminist Data</em></em> written by Caroline Sinders (2018).
|
||
<br><br>
|
||
In the second step, the algorithm counts relatively against all the other documents in the same dataset, using the <em>Inversed Document Frequency</em> (IDF). This part of the algorithm, which is Karen Spärck Jones’ addition, introduced a subtle form of inversed relative counting throughout all the documents in the dataset. Instead of just counting word-frequency in one document, Karen proposed to count in a relative inter-document way.
|
||
<br><br>
|
||
This means that when a word only appears in one or a few documents, that its value is greatly enlarged. The concequence being that words as <em><em>the</em></em> or <em><em>it</em></em> will be given a very low number, as they appear in all the documents. And specific words, such as <em>paranodal</em> in <em>A Feminist Server Manifesto</em>, will get a very high value as this word is only used 4 times in the whole dataset and all of those 4 occurances where in this manifesto.
|
||
<br><br>
|
||
Another example is <em>SCUM</em>. Although the word <em>SCUM</em> is not the most commonly used word in the <em>S.C.U.M. Manifesto</em>, it is the word that gets the highest score: relative to all the other manifesto's, <em>SCUM</em> is mostly used in this manifesto. This increases the score a lot.
|
||
</p>
|
||
</div>
|
||
|
||
<div id="mappings">
|
||
<h1>{{ manifesto | prettyfilename }}</h1>
|
||
{% for sentence in mappings %}
|
||
<p class="sentence">
|
||
{% for word, tfidf in sentence %}
|
||
<strong class="query" style="font-size:{{ 50 + tfidf }}%;"> <a href="/{{ lang }}/?q={{ word }}">{{ word }}</a> </strong>
|
||
{% endfor %}
|
||
</p>
|
||
{% endfor %}
|
||
</div>
|
||
{% endblock %}
|
||
|
||
{% block suggestions %}
|
||
{% endblock %}
|