Live-Writing a book (7): Transcription hell

Photo by rawpixel on Unsplash

I’m writing my third book live on the Internet, and you can follow along! Today: the complete, detailed phase outline!

If you missed the previous parts of this story, the first post describes how to collect ideas. The second post is about outlining. In the third part, we expand the outline using the hero’s journey. In the fourth part, we talk about how to make the plot stronger. In the fifth part, we talk about researching the details. And in the sixth part, I explain how I outline the book.

The transcription is done (finally!) and I have the first draft of the book ready!

This took forever, compared to the other books. At the beginning of April I finished the phase outline and started dictating, and only yesterday I finished the transcription. That’s three months.

Why it took forever

Why did it take so long? There are multiple reasons, none too convincing.

  1. It it the third book of a series that I just had worked too much on in too short a time. The first two books were written in the first three months of this year, and the third book was planned right after them, in April, and dictated throughout May. This was a bit too much, without time to properly regenerate the creative juices.
  2. In between (partly because of “Rosemary fatigue,” see the previous point), I wrote and published another whole book about Neural Networks and AI (which is related to my daily job). It was quite refreshing to not have to deal with plot and character voice for a change, and to just write down what I knew about a subject. It went incredibly smoothly, and a month later I had my book ready on Amazon. The story of that book is here.

  3. Another factor is the sheer tediousness of transcribing one’s own dictation. There’s no surprise there, because one knows everything one dictated; but also there’s no sense of accomplishment, since the dictation is very raw, and so what one transcribes doesn’t even remotely resemble a book. This can be quite demotivating. Additionally, the recording is full of pauses, false starts, and half-finished sentences that take a long time to listen through without providing any value at the time of transcription. Once I had a whole half hour recording that was just me trying to find out what would happen next, and failing. Half hour of transcribing and zero words added to the draft. Other times, I’d transcribe some digression of twenty minutes that, in the end, I’d remove from the draft.

  4. A problem is also a purely operational factor. Transcribing, as opposed to writing a first draft, requires use of both a computer and one’s ears. In this respect, it is limited to particular, selected, precious situations. One can dictate almost everywhere, even while driving or waiting in a queue. One can take notes by hand everywhere one can stand for a moment. One can type everywhere one can sit down. But transcription requires setting up the computer, selecting the file, finding the right position in the file, connecting one’s earphones, and from then on being unavailable for most human communication. Dictation thus cannot be done while, say, keeping an eye on the kids (while one could well outline or write a first draft in this situation), or while sitting with others on a beach or in a living room. It is much more socially isolating than any other form of writing. Thus, it can only be done in moments where one is alone but in a position to use a computer, where the environment is quiet and undemanding, and where one has a relatively long stretch of time to make setting up the whole environment worthwhile (for me, at least half an hour of uninterrupted time). This limits the possible transcription opportunities very much.

Okay, I’ll stop whining now. It took three months, so that’s what it took. Time to move on.

Book dictation stats

Here’s how some other stats of the book’s dictation worked out.

The whole first draft, finally typed up, is 23367 words without the scaffolding and outline. In a previous post I had described my phase outline. This I removed when I finished the first draft.

The recording of the first draft itself consisted of 13 audio files (of which one did not actually contain any useful text). The file lengths were (minutes:seconds):

21:43, 28:55, 33:38, 67:23, 21:39, 12:09, 17:08, 42:53, 36:45, 18:27, 21:38, 11:11, 20:21.

These add up to 353 min (5.9 hours) of dictation for 23367 words, or 66 words per minute, or 3972 words per hour. This is, of course, a great speed, which I could never achieve by typing a draft directly.

Of course, not all are useful words, and there are big pauses where nothing happens. In fact, about 1/3 of the time, at least, is empty space. The recording does not stop, because I did the dictation often in the car or on the street with background noises, and so the recording goes on although I’m not talking. In reality, therefore, the effective words per hour of really speaking would be more like 6000 words per hour (assuming 1/3 empty time on the recording). But then, the transcription time has to be added to that. If I manage 20 words per minute, this would add another 21 hours for the transcription, or about 3.5 hours of transcription for each hour of dictation. If one could transcribe at dictation speed, this would be reduced accordingly, but I am not fast enough.

The mean length of recording files is 27 minutes, very close to the median of 22 minutes, which reflects the fact that I dictated most while driving to the office, a 20 minute drive (with a few minutes of dictation before and after, sitting in the car). Only on a few, exceptional situations did I dictate longer or shorter times.

First draft structural problems

If I look at the story structure in the dictated raw draft and where the various plot points fall, I get the following picture:

  • 1%: Introduction. The characters are introduced. References to other volumes and the backstory.
  • 38% (should be 20%): The problem arises. The bad guy or bad thing appears.
  • 55% (should be 40%): Identification of what needs to be done.
  • 69% (should be 60%): Coming up with solutions.
  • 73% (should be 80%): Execution, climax, twist, surprises, and finally success.
  • 96% Happy end, reward.

It is easy to see how the dictation influences structure. In the beginning, perhaps more than it would be if I had typed the text directly, I am looking myself for the story, trying to describe things in more detail, so that I can myself see them in front of my mind’s eye. So getting into the story takes about twice as long as it should, with the introduction being double it’s desired length. The second part (“problem arises”) is roughly right in length. After that, the final parts get progressively shorter (too short). The third part is only 14% of the book (assuming they should all be roughly equal-spaced at 20%), the fourth part is only 4% (!); but the fifth part (“execution”) again is longer (a bit too long). The “reward” part is fine. That’s not supposed to be a long section.

In re-working this into a viable first draft, I must definitely cut the introduction down to about half its present length. I could distribute some of that exposition to the other, later parts of the book. It’s okay if the in-between parts are somewhat shorter and the execution longer, but a too slow introduction will kill any reader interest.

Dictation and Google voice typing

I wrote my Neural Networks book entirely using Google’s “voice typing” feature in Google Docs. It went reasonably well, and I had hoped that I might be able to use it again. Unfortunately, this time it didn’t work at all.

One thing is that in writing the Neural Networks book, I was dictating text that I had often presented in lectures in exactly the same way. So I could talk like I would be lecturing about the topic, and this is exactly how Google voice typing likes its input: slow but steady, articulated, in fluid sentences.

With the Rainforest book, I was thinking about the scenes while I was dictating, picturing things in my mind for the first time. So there were often minute-long breaks between sentences or even between words. Pauses, in which I was thinking how to proceed with an image or a sentence. This completely freaked out Voice Typing, which started inserting random periods where I took to long to speak; this, in turn, freaked out me, who felt harassed to talk faster in order to keep Voice Typing happy. Not a good thing.

In an attempt to improve accuracy, I tried different automated transcription services. Some seemed promising, but. After. Transcribing. My text. It. All. Looked. Like. This.

At least, this was one problem Google did not have (although it does drop in randomly capitalised words).

Here’s a sample transcript, directly from the dictated text:

“Inside wear 10 beds, 5 on each side, with a small corridor in the middle period this is enough for us that, sister Teresa said. Most of the time these bad stay empty. Pajama Mama don’t like to come to this Hospital period even if they are sick or injured they prefer to stay in their Village and to be taken care of by their own medicine man rather than to come here. It is only the Young Who once after they have gone to school with us, understand that it is important to come here.”

There’s a lot wrong here, and needlessly wrong. Google’s splendid AI empire could surely do a better job than that.

  • “wear” instead of “were” makes no sense grammatically here. Google could know this.
  • “period” is not recognised as punctuation most of the time. Sometimes it does, but the effect is too random to be reliable.
  • “Most of the time these bad stay empty.” Again, a simple grammar analysis of this sentence should be able to fix that. Clearly, we are talking about “these beds.”
  • “Pajama Mama” is a unique word in the transcription that appears nowhere else, and this already should tell Google that something is wrong. I cannot imagine why this would even be a known word to Google. Surely not many people dictate that? — Granted, “Yanomami” is not more common, but it appears in every second sentence in this text, and after correcting it a few times, Google’s voice typing should have picked it up.
  • Why is “Village” capitalised there? Makes no sense.
  • The “Young Who”: is this supposed to be a rock band revival name?

And it goes on and on like that. I count around 18 errors requiring intervention in 6 lines, or about 3 per line. This makes dictation not worth the effort. Cleaning up such a transcript is almost as much work as typing it directly from the audio file oneself.

Here’s another parting impression of voice typing’s results:

“When it rains we will sometimes come here search sister Teresa. To this Hospital. Patience love it because it’s a distraction. For a while they are not alone and for a few hours they can listen to the lectures and they can also learn something. And here is our only patient she said period she went over to the only that that had someone inside. The boy off around 10 or 12 years perhaps period not much older than Mary and rose themselves period”

It’s easy to see how most of those errors could have been avoided by even a simple statistical, predictive algorithm, like the one Google uses to autocorrect search terms in its search engine.

Other transcription services

Yes, I know that there are more, and better software solutions for voice transcription out there, with the Dragon family of products probably the most prominent ones. But I use Linux on the desktop and Android on the tablet, and I’m not willing to buy and install Windows just to do a transcription. Of the online transcription services, I tried all that I could find that were free or cheap and automated. They were all a lot worse than Google voice typing, which is to be expected. After all, Google is one of the technological leaders in the world of AI and voice recognition, and a small company doesn’t have much of chance to be better than them.

Human transcription, on the other hand, would probably perform quite well, but it is too expensive for the beginning novelist. A book already requires quite a bit of an investment in production costs for two editing passes (developmental and line editing), cover, and, if it’s a children’s book, interior illustrations; plus marketing, ads etc. As it is, each book in the Rosemary series costs me more than 1000 USD to produce (and I’m essentially getting the covers for free, because I design them myself based on the interior images of the books.) It would be madness to add another 500 or so for a human transcription, when the income from the books is close to zero.

To dictate or not dictate?

So that’s how I do it. YMMV, as they say. For each book, I’m always tempted to drop the dictation/transcription step and go directly to typing the first draft — it would be so much easier, it seems. But then I remember why I dictate in the first place: because then I can tell the story quickly to myself, in one go, without stopping, without getting stuck, without going so slow that I have time to think (and overthink) what I’m writing. The dictation frees my mind to wander around, to take tangents, to back up and try different paths and perspectives in the story. True, all these different paths are a pain in the behind to transcribe later, but without them the writing itself would be much poorer, the story much more shallow, and it would be missing most of the fun bits that come to me only while I speak the book to myself.

And not to forget the fabulous speed of 4k an hour or more.

So I guess that I’ll have to stick with that method, even if it means slaving through the transcription later.

Join me next week for another post on what I learned from preparing this book (and the whole series) for publication.

If you wish to subscribe (no spam, only a notification when there are new posts on my blog), please go here.

Thanks for reading!