Special Issue 5 - OuNuPo
XPUB, Varia and WORM invite you for an evening of book scanning, short presentations, discussions and software experiments in the context of text digitisation and processing. 28/03/18 - 19:00 at WORM
OuNuPo, Ouvroir de Numérisation Potentielle, the workshop of potential digitisation
From January until the end of March 2018 the practitioners of the Media Design Experimental Publishing Master course (XPUB) of the Piet Zwart Institute, in collaboration with Manetta Berends & Cristina Cochior (Varia) and the WORM Pirate Bay, have set sail on the vast sea of DIY book scanning, feminist research methodologies, constraint writing, algorithmic literature and the cultures of text digitisation and processing.
The term OuNuPo is derived from OuLiPo (Ouvroir de littérature potentielle), founded in 1960. OuLiPo is a mostly French speaking gathering of writers and mathematicians interested in constrained writing techniques. A famous technique is for instance the lipogram that generates texts in which one or more letters have been excluded. OuLiPo eventually led to OuXPo to expand these creative constraints to other practices (OuCiPo for film making, OuPeinPo for painting, etc). Following this expansion, XPUB launches OuNuPo, Ouvroir de Numérisation Potentielle, the workshop of potential digitisation, turning the book scanner as a platform for media design and publishing experiments.
In the past three months, the XPUB practitioners have used OuNuPo as a means to reflect on several topics: how culture is shaped by book scanning? Who has access and who is excluded from digital culture? How free software and open source hardware have bootstrapped a new culture of librarians? What happens to text when it becomes data that can be transformed, manipulated and analysed ad nauseam?
To answer these questions, the XPUB practitioners have written software and assembled a unique printed reader, informed by critical and feminist research methodologies. The text selection explores the themes of the digital transfer of cultural biases, Techno/Cyber/Xeno-Feminism, oral culture in the context of knowledge sharing, shadow libraries, database narratives, gender and future librarians. The content of the reader will be scanned by a DIY book scanner built in the past months, and processed by different software processes and performances written by the XPUB practitioners, from chat bots to concrete poetry generators and speech recognition feedback loops.
Inside the workshop of potential digitisation:
To approach the workshop of potential digitisation, the following strategy was adopted: two book scanners were built using a variation of the Archivist Book Scanner, a 2014 public domain (CC0 licensed) hardware design developed within the DIY Book Scanner community. Next to that, a unique reader was put together in the form of 6 books on scanning cultures, edited, designed and produced by the XPUB practitioners. Each book is a compilation of 5 to 10 annotated texts addressing a specific question, or topic, relevant to the practitioners. The 6 books are gathered inside a cloth, folded according to the Japanese Furoshiki art of wrapping. Finally, instead of using the book scanner as a mere text scanning and PDF creating apparatus, each XPUB practitioners wrote their own text processing software to echo, reflect upon, or explore further their reading material as a means to articulate through code the two levels of textual interpretation and dissemination: the human and the machine. Using the book scanner and the software they wrote, they will scan and make public the reader, not as one-to-one digital copy like a downloadable PDF file, but as the output of a series of software experiments.
Chapter 1 - Alice StreteTechno/Cyber/Xeno-Feminism + carlandre & overunder
output/carlandre.txt: ocr/output.txt cat $< | python3 src/carlandre.py > $(@) output/overunder: ocr/output.txt python3 src/overunder.py
The Intimate and Possibly Subversive Relationship Between Women and Machines Reader explores topics from women's introduction into the technological workforce, the connection between weaving and programming, and using technology in favour of the feminist movement. One major concept that appears throughout the reader is an almost mystical connection between women and software writing, embedded deep in women's tradition of weaving not just threads, but networks. Does software have a gender?
Echoing to her selection of texts, Alice proposes two software based transformation of her reader: carlandre and overunder. carlandre is a program that generates a pattern inspired by the concrete poetry of Carl Andre, it creates a vertical wave of words whose lengths go from ascending to descending and so on. overunder is inspired by the relationship between weaving and programming, this interpreted language written in Python translates simple weaving instructions into a digital interpretation of weaving on text.
Chapter 2 - Joca van der HorstWho is the Librarian + Reading the Structure
reading_structure: ocr/output.txt ## Analyzes OCR'ed text using a Part of Speech (POS) tagger. Outputs a string of tags (e.g. nouns, verbs, adjectives, and adverbs). Dependencies: python3's nltk, jinja2, weasyprint mkdir -p output/reading_structure cp src/reading_structure/jquery.min.js output/reading_structure cp src/reading_structure/script.js output/reading_structure cp src/reading_structure/style.css output/reading_structure cat $< | python3 src/reading_structure/reading_structure.py weasyprint -s src/reading_structure/print-noun.css output/reading_structure/index.html output/reading_structure/poster_noun.pdf weasyprint -s src/reading_structure/print-adv.css output/reading_structure/index.html output/reading_structure/poster_adv.pdf weasyprint -s src/reading_structure/print-dppt.css output/reading_structure/index.html output/reading_structure/poster_dppt.pdf weasyprint -s src/reading_structure/print-stopword.css output/reading_structure/index.html output/reading_structure/poster_stopword.pdf weasyprint -s src/reading_structure/print-neutral.css output/reading_structure/index.html output/reading_structure/poster_neutral.pdf weasyprint -s src/reading_structure/print-entity.css output/reading_structure/index.html output/reading_structure/poster_named_entities.pdf x-www-browser output/reading_structure/index.html
With Who is the Librarian: The gendered image of the librarian and the information scientist, Joca explores two frequent gender stereotypes: librarianship as a job for women and information science as a male-dominated field. The selection of texts in this reader elaborates on the origin of these stereotypes and the different social status of these professions. This could be the way to answer the question: Who do we want to be the librarian in the future?
Then moving from human interpretation to software interpretation, Joca presents a software, Reading the Structure, that attempts to make visible to human readers how machines, or to be more precise, specific software implementation of text analysis, interpret texts. Computers read a text differently than we do. One of the common methods for software to analyse a text, is to cut the sentences into loose words. Then each word can be labelled for importance, sentiment, or its function in the sentence. During this process of structuring the text, the relation with the original text fades away. Reading the Structure is a reading interface that brings the labels back in the original text. Does that makes us, mere humans, able to read like our machines do?
Chapter 3 - Zalán SzakácsFrom DIY Book Scanning to the Shadow Librarian + ACCP - Analogue Circular Communication Protocol
Zalán's reader, From DIY Book Scanning to the Shadow Librarian, traces back the beginnings of the shadow libraries starting from the Soviet era of Russia and explores its impact on contemporary academic publishing. Amongst other things, the text selection Informs the reader about activists in this field such as Aaron Swartz, the writer of Guerilla Open Access Manifesto and Alexandra Elbakyan, the founder of Sci-Hub.
Where does the message start? Where does the message end? The user is challenged by the coding tool ACCP to discover the rules behind the circular decoding system and decipher the message. Through the programming language Python and the software DrawBot, words are processed and mapped into a spatial graphical system with the 26 characters of the alphabet and the 10 numbers are arranged around a circle. With a radial stencil placed in front of the graphics, it is possible to turn the images back into words.
Chapter 4 - Natasha BertingHow Bias Spreads from the Canon to the Web + Erase / Replace
erase: tiffs hocrs python3 src/erase_leastcommon.py rm $(input-hocr) rm $(images-tiff) replace:tiffs hocrs python3 src/replace_leastcommon.py rm $(input-hocr) rm $(images-tiff)
Natasha's contribution explores the politics of selection, transparency and as Johanna Drucker said, "calls attention to the made-ness of knowledge". Her selection of texts explores how human biases and cultural blind spots are transferred from the page to the screen, as companies like Google turn books into databases, bags of words into training sets, and use them in ways that are not always clearly communicated.
The texts will be processed by the Erase / Replace scripts, which are two experiments that question who and what is included or excluded in book scanning. In each script, what is first scanned affects what is visible and what is hidden in what is scanned in a second stage, so on so forth. The scripts learn each page's vocabulary and favours the most common words. The least common word recede further and further away from view, finally disappearing all together or even replaced by the more common words. Every scan session results in a different distortion, and outputs the original scanned image, but with the text manipulated.
Ultimately these texts and scripts are tools for thinking about how knowledge is mined and presented online, how bias spreads from the Canon to the web, finding opportunities to break open this process.
Chapter 5 - Alexander RoidlScanning the Database + chatbook
chatbook: ocr/output.txt python3 src/chatbook.py oulibot: ocr/output.txt #chatbot based on the knowledge of the scans Dependencies: nltk_rake, irc, nltk python3 src/oulibot.py
In Scanning the database, Alexander offers to navigate in and out of database narratives. His reader looks at how databases are structured and formed, and how the data they hold are classified, and how such structuring and classification leads to bias. It shows how important it is to question the authoritative dimension of databases, by looking closely at what is being scanned, how it is stored, organized and selected.
In response to these questions, Alex proposes an alternative interface to such database, by creating a chat bot that enables the user / viewer to explore the content of scanned material by chatting with the books canner. By adding an explicit layer of software mediation, the experiment questions how knowledge is built and mediated in the age of machine learning.
Chapter 6 - Angeliki DiakrousiFrom Tedious Tasks to Liberating Orality + ttssr->> Reading and speech recognition in loop
ttssr-human-only: ocr/output.txt bash src/ttssr-loop-human-only.sh ocr/output.txt
Angeliki's collection of texts From Tedious Tasks to Liberating Orality- Practices of the Excluded on Sharing Knowledge, refers to oral culture in relation to programming, as a way of sharing knowledge including our individually embodied position and voice. The emphasis on the role of personal positioning is often supported by feminist theorists. Similarly, and in contrast to scanning, reading out loud is a way of distributing knowledge in a shared space with other people, and this is the core principle behind the ttssr-> Reading and speech recognition in loop software. Using speech recognition software and python scripts Angeliki proposes to the audience to participate in a system that highlights how each voice bears the personal story of an individual. In this case the involvement of a machine provides another layer of reflection of the reading process.
OuNuPo was produced as part of a collaboration between XPUB and WORM. The project was developed by the XPUB practitioners (Natasha Berting, Angeliki Diakrousi, Joca van der Horst, Alexander Roidl, Alice Strete and Zalán Szakács) with the support from Varia special guests (Manetta Berends and Cristina Cochior), the WORM Pirate Bay (Wojtek Szustak and Frederic Van de Velde), diybookscanner.eu (Mark Van den Borre) and XPUB staff and tutors (Delphine Bedel, André Castro, Aymeric Mansoux, Michael Murtaugh, Leslie Robbins and Steve Rushton).
OuNuPo-Make Code Repository
Software experiments for the OuNuPo bookscanner, part of Special Issue 5
Natasha Berting, Angeliki Diakrousi, Joca van der Horst, Alexander Roidl, Alice Strete and Zalán Szakács.
git clone https://git.xpub.nl/repos/OuNuPo-make.git
- GNU make
- Python3 NLTK
pip3 install nltk
Sitting inside a pocket(sphinx): Angeliki
Speech recognition feedback loops using the first sentence of a scanned text as input
- PocketSphinx package
sudo aptitude install pocketsphinx pocketsphinx-en-us
- PocketSphinx Python library:
sudo pip3 install PocketSphinx
- Other software packages:
sudo apt-get install gcc automake autoconf libtool bison swig python-dev libpulse-dev
- Speech Recognition Python library:
sudo pip3 install SpeechRecognition
- TermColor Python library:
sudo pip3 install termcolor
- PyAudio Python library:
sudo pip3 install pyaudio
© 2018 WTFPL – Do What the Fuck You Want to Public License. © 2018 BSD 3-Clause – Berkeley Software Distribution
Reading the Structure: Joca
Uses OCR'ed text as an input, labels each word for Part-of-Speech, stopwords and sentiment. Then it generates a reading interface where words with a specific label are hidden. Output can be saved as poster, or exported as json featuring the full data set.
- NLTK packages: tokenize.punkt, pos_tag, word_tokenize, sentiment.vader, vader_lexicon (python3; import nltk; nltk.download() and select these models)
- spaCy Python library
- spacy: en_core_web_sm model (python3 -m spacy download en_core_web_sm)
- font: PT Sans
- font: Ubuntu Mono
License: GNU AGPLv3
Permissions of this license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights. When a modified version is used to provide a service over a network, the complete source code of the modified version must be made available. See src/reading_structure/license.txt for the full license.
Erase / Replace: Natasha
Receives your scanned pages in order, then analyzes each image and its vocabulary. Finds and crops the least common words, and either erases them, or replaces them with the most common words. Outputs a PDF of increasingly distorted scan images.
For erase script run:
For replace script run:
- NLTK English Corpus:
- run NLTK downloader
python -m nltk.downloader
- select menu "Corpora"
- select "stopwords"
- run NLTK downloader
- Python Image Library (PIL):
pip3 install Pillow
- PDF generation for Python (FPDF):
pip3 install fpdf
- HTML5lib Python Library:
pip3 install html5lib
Notes & Bugs:
This script is very picky about the input images it can work with. For best results, please use high resolution images in RGB colorspace. Errors can occur when image modes do not match or tesseract cannot successfully make HOCR files.
carlandre & over/under: Alice Strete
Person who aspires to call herself a software artist sometime next year.
Copyright © 2018 Alice Strete This work is free. You can redistribute it and/or modify it under the terms of the Do What The Fuck You Want To Public License, Version 2, as published by Sam Hocevar. See http://www.wtfpl.net/ for more details.
Description: Generates concrete poetry from a text file. If you're connected to a printer located in /dev/usb/lp0 you can print the poem.
Description: Interpreted programming language written in Python3 which translates basic weaving instructions into code and applies them to text.
- over/under works with specific commands which execute specific instructions.
- When running, an interpreter will open:
- To load your text, type 'load'. This is necessary before any other instructions. Every time you load the text, the previous instructions will be discarded.
- To see the line you are currently on, type 'show'.
- To start your pattern, type 'over' or 'under', each followed by an integer, separated by a comma. e.g. over 5, under 5, over 6, under 10
- To move on to the next line of text, press enter twice.
- To see your pattern, type 'pattern'.
- To save your pattern in a text file, type 'save'.
- To leave the program, type 'quit'.
Description: Chatbot that will help you to write a poem based on the text you inserted by giving you constraints.
- irc :
pip3 install irc
- rake_nltk Python library:
pip3 install rake_nltk
pip3 install textblob
pip3 install Pillow
pip3 install numpy
pip3 install tweepy
- NLTK stopwords:
- run NLTK downloader
python -m nltk.downloader
- select menu "Corpora"
- select "stopwords"
- run NLTK downloader