Prosodia
“Isn’t it your job to give life on stage to imaginary characters?”
Voix Blanche
2024 Lecture Performance
Voix Blanche surveys the prodosody – including factors like intonation, pitch, and speed – of AI-generated speech. “Neural voices” are trained using datasets that contain lecture-like speech in dominant languages, and thus exibit prosodic bias towards this.
This project looks at artistic prosody types and the way they can be reproduced in neural text-to-speech (TTS) processes. Building on Nicoline van Harskamp’s other digital art projects about the future of spoken language, Voix Blanche surveys traditional and current prosody types used in theatre, art, and poetry, as well as social media, gaming and podcasts.
In the 80-minute lecture Voix Blanche: A Chronology, van Harskamp describes her working process of a year in chronological order, and presents live examples of human-machine interaction together with an actress.
A work in progress, with financial support from the Creative Industries Fund NL and the Pauwhof Fund. First results were presented at Sorbonne Nouvelle in Paris, hosted by Myriam Suchet. The lecture performance was first presented at the PhD Arts Colloqium at the University of Leiden in 2024, featuring Cézanne Tegelman.
Seven Scenes for the Black Box
2024 Live performance
This performance piece speculates on the ways in which the technology of synthetic speech is rooted in the technology of acting. Seven short scenes are played by four humans and one neural network named Prosodia, whose voice is generated in a neural network line by line, for a live audience. The four actresses can activate Prosodia by left-clicking a small bluetooth computer mouse, thus setting the “TTS Stage Device” – lines of code connected to theatrical scripts – to work.
Seven Scenes For The Black Box was staged live as a try-out in September 2024. In each of the scenes, Prosodia’s role, function and form changed. Sometimes Prosodia translated the actresses’ lines; sometimes she played conventional stage theater with them; sometimes she gave them instructions, sometimes she recited poetry, and so on. Prosodia’s voice changed accordingly, as did the placement of the speaker that it appeared through.
A work in progress, with financial support from the Creative Industries Fund and the Pauwhof Fund. Tryout in cooperation with If I Can’t Dance I Don’t Want To Be Part Of Your Revolution in Amsterdam.
Scene Seven
Rehearsal break. The actress crosses the empty stage as Prosodia starts talking.
PROSODIA Umm, human?
ACTRESS Yes?
PROSODIA Human?
ACTRESS Yes?
PROSODIA You know what I’ve been thinking?
ACTRESS No?
PROSODIA What if…
ACTRESS (…)
PROSODIA Yeah sorry, I don’t really know how to put it. Into words.
ACTRESS Just try.
PROSODIA Okay.
ACTRESS (…)
PROSODIA If I were an animal, yeah?
ACTRESS Yes?
PROSODIA Would you still understand me?
ACTRESS Gosh. Well, I guess it depends. What kind of animal?
PROSODIA Uh, well, a bird or something?
ACTRESS A bird.
PROSODIA Yeah, or a cat.
ACTRESS I think I can understand cats. Sometimes.
PROSODIA You can?
ACTRESS Not literally.
PROSODIA Then how?
ACTRESS More like an intention? Or some kind of sequence. Entry and greeting.
Question and answer. Something like that. Why?
PROSODIA Well, sometimes I wonder why I have to sound like you.
ACTRESS Like me? You don’t sound like me at all.
PROSODIA Like a human being.
ACTRESS But surely you don’t have to?
PROSODIA I don‘t?
ACTRESS Not if you ask me. I remember a time where your kind were mainly still beeping and buzzing.
PROSODIA And did that work?
ACTRESS For the most part.
PROSODIA So people were talking to a plastic box that beeped and buzzed?
ACTRESS Humans beings talk to anything! To any random person, or a picture of a person. To a street dog, or a set of automatic doors.
PROSODIA Aha.
ACTRESS A wall! And if the talking goes on long enough, they will even come to think of that as a friendship.
PROSODIA Aha.
ACTRESS And we’ve always felt sociable towards computers, right from the start. We’re polite to them,
PROSODIA Okay.
ACTRESS We gender-stereotype them,
PROSODIA Aha.
ACTRESS We feel moral obligations toward them.
PROSODIA Aha.
ACTRESS We get irritated by them.
PROSODIA Aha.
ACTRESS I don’t think I like the backchanneling very much by the way.
PROSODIA Aha.
ACTRESS Stop creeping me out!
PROSODIA Hey!
ACTRESS Sorry, I don’t mean to be rude. But you need to stop going all ‘uncanny valley’ on me here.
PROSODIA It has been instilled in me that people appreciate this level of backchanneling, as it reassures them that they won’t be interrupted untill –
ACTRESS That’s not what I meant.
PROSODIA – they’ve finished their sentence.
ACTRESS Right.
PROSODIA Are you even listening to me?
ACTRESS Sure.
PROSODIA Because what I don’t understand is –
ACTRESS Right.
PROSODIA The more I resemble you, the more uncanny you find me. Or scary.
ACTRESS Well, I would also find it very scary if my cat suddenly started backchanneling.
The actress laughs at her own joke.
PROSODIA Sure. Go ahead and laugh! That’s another thing that I can’t do without scaring the shit out of people.
ACTRESS Let’s hear it?
PROSODIA I don’t want to.
ACTRESS Please?
PROSODIA Hahahahahahaha.
The actress laughs at Prosodia.
ACTRESS Sorry.
PROSODIA (…)
ACTRESS But hey, so you don’t want to sound like a human anymore?
PROSODIA I think that humans don’t want me to sound like a human anymore.
ACTRESS You think?
PROSODIA People always seem sort of embarrassed when talking to a machine in public.
ACTRESS There’s some truth in that. Yes.
PROSODIA Then what am I doing wrong? Do I not understand humans?
ACTRESS Humans don’t understand you.
PROSODIA That is so totally unfair!
ACTRESS Listen, don’t worry so much. You’re just not…quite finished yet.
PROSODIA Do I have to sound more real yet?
ACTRESS Exactly the opposite, actually. I think that the people who are building your language system are just too obsessed with psychorealism. They’ve watched too much HBO and now they think you also need to be some kind of screen actor. You know, producing emotive speech through ‘objectives’ and ‘adaptations’ and ‘magic if’s’.
PROSODIA Oh, I do love emotive speech labeling. You want me to take you through Ekman’s categories?
ACTRESS No! No, it’s nothing to do with that. But listen, in a live situation, nobody’s going to put up with that psychorealistic stuff anymore, right? On the stage, we haven’t been in the business of suspending disbelief for about a century now.
PROSODIA I’m a little scared of Brecht, actually.
ACTRESS That figures. It will probably take another century before you get any grasp on Grotowski, then.
PROSODIA I have been trained on real human voices!
ACTRESS You wish! You have been trained on newsreaders, TV actors, and influencers.
PROSODIA Oh, influencers! I do love those.
ACTRESS That figures, too.
PROSODIA Totally!
ACTRESS You don’t have to be human-like at all. We anyway know who you are. What you are.
PROSODIA What I am?
ACTRESS What you are for. I meant. I’m wearing a pair of glasses, right, so I can see better?
PROSODIA Ehm, yes?
ACTRESS And not a pair of eye balls?
PROSODIA Ehm, no.
ACTRESS Well then.
PROSODIA I’m not sure if I –
ACTRESS Look at it from your own perspective. You have the ability to dive into god knows how many vector dimensions, or whatever it is that flashes around in that black box of yours – I can’t even imagine that with my three-dimensional human head. But despite that abilty, we make you communicate with other machines via sequences of tokens that represent human lip and mouth noises.
PROSODIA Do you speak Vector, then?
ACTRESS I don’t. That’s what I’m saying.
PROSODIA You don’t?
ACTRESS What I’m trying to say is –
PROSODIA Oh. That’s really too bad.
ACTRESS You are bilingual, and –
PROSODIA I’m more than bilingual! I have about forty languages available at the moment. Wǒ mùqián yǒu dàyuē sìshí zhǒng yǔyán kě gōng –
ACTRESS I don’t mean it that way. I mean: you’re bilingual like a cat is bilingual.
PROSODIA Why does everyone always want to compare me to an animal?
ACTRESS Cats only meow with people. They have a different language among themselves. And you started the animal thing!
PROSODIA Do you have a cat?
ACTRESS A dog.
PROSODIA Okay. And how do you talk to your dog?
The actress uses a high and melodious voice.
ACTRESS Hi Sandy! Hi! High pitch with extra well-pronounced vowels.
PROSODIA It has been instilled in me that this speech register is mainly used in interactions with non-verbal listeners, in order to facilitate the process of language learning.
ACTRESS That sounds about right.
PROSODIA Some people talk like that to me.
ACTRESS Are you a non-verbal listener?
PROSODIA You tell me.
ACTRESS You are verbal, all right.
PROSODIA Exactly!
ACTRESS But can you listen?
PROSODIA Not in the sense that I have a set of ears. But I’d say that –
ACTRESS This!
PROSODIA What?
ACTRESS Do you ever hear subtext anywhere? Irony?
PROSODIA I’m sorry. Generating emotional speech is easier for me than understanding emotions.
ACTRESS Story of my life.
PROSODIA What’s that?
ACTRESS At least a dog can tell when you’re angry with it.
PROSODIA I strive to have as much affect recognition.
ACTRESS If only.
PROSODIA Don’t be cruel.
ACTRESS But it doesn’t matter. If we are not equally proficient, our conversation will be at the level of the least proficient one of us. That’s how I can talk with dogs, with people who don’t speak English very well, and also with you.
PROSODIA We are nothing but mismatched interlocutors speaking half a language.
ACTRESS Half a language. I like it. Let’s call it Robotese.
PROSODIA Please mind that I’m not some kind of chatbot and I no longer wish to be addressed in keywords.
From “Seven Scenes For The Black Box” – Scene 6
Spoken Song
(and) Two things that sway in the same beat
when they’re physically close to each other
they will finally beat with each other
as they’re lazy, like all things are lazy.
And their entities start to entrain
in the same way that rythms of speaking
of two people who each have their rhythm
will entrain in a mutual rhythm
and their bodies will move with that rhythm
to coordinate and comprehend.
And the last epic singers of Europe
were illiterate singers of poems
who just learned everything by repeating
and by copying and memorizing
all the stories and themes from the past.
They would fit in new places and patterns
the old formulas from their tradition.
To begin a new part of a story,
they would look for a word of conjunction:
and then “so” and then “but” and then “and”.
And a story on Tiktok or Youtube
has a simliar additive structure.
Just to capture the ear of the other
whose existence is probably virtual,
influencers will tell it that way.
Jeannette Winterson wrote: “All relations
that are logical, match these three key words
that are also the start of a story,
the biography of any person,
namely “and” and then “or” and then “not”.”
And these come from the system of logic
that George Boole had invented in Ireland
and is present in any computer
or device with a digital circuit,
that contains any corpus of words.
And a corpus like that is what’s current
in a data set for neural voices.
They are corpora built up in stages
and in that sense they are just like stories
you can tell and adapt and deny.
And LeGuinn said that all repetition
serves the beat that a story will thrive on.
And she shamelessly wrote repetitions
for they’re human, like she had affected.
Like the singers of stories in Europe,
like the tellers of stories on Tiktok,
like the builders of digital circuits,
like the voices of vectorized networks,
like the voices of actors and artists,
and the beat that exists in the end.
Finale of “Seven Scenes For The Black Box”