by Ben Smith on 5th August 2009

spinvox_logo1Along with a number of other bloggers and ‘proper’ journalists we went to SpinVox HQ in Marlow yesterday to get ‘hands on’ and ‘see for ourselves’.  The session – initially planned to run for an hour – wrapped up after around 90 minutes with the arrival of a second group who were to see the same presentation.

Please see this post for information about The Really Mobile Project co-founder James Whatley’s position with SpinVox and his current leave-of-absence from The Really Mobile Project.  We have made a detailed disclosure statement in this post. If you are concerned this post is affected by a conflict of interest for the authors, other coverage of the session Ben and Dan attended is available from Techcrunch UK and The Register.

We’ve included far more detail below, but for the attention-deficit crowd here are the key points:

  1. The presentation and demo made no attempt to address the most serious allegations against SpinVox.
  2. The demo, even under controlled circumstances, failed to demonstrate anything more than very basic automated transcription. Most messages went to the human transcriber, whole.
  3. The system we saw showed a ‘build date’ of 29th July so may not be one that has transcribed many (if any) real customers’ messages.

SpinVox, to us, looks to have developed a very competent tool to assist human transcribers.  Nothing we were shown convinced us the automatic speech recognition added much benefit at all.

We are amazed they believed this demonstration would support their claims and even more amazed they chose only to focus on the technical.

The longer version…

The event had been billed as a technical demonstration and we felt increasingly uneasy as we chatted in the taxi there.  Having drawn-up a list of questions from readers and other coverage, very few of the key issues really related to ‘how it works’.  Far larger concerns were  security, scalability, the state of the business and – crucially – the question ‘why has this happened?’  In a world of disgruntled ex-employees and aggressive competitors, why was SpinVox the one landing all the negative press in some of the largest mainstream media?

What did we see?

Presenting, Rob Wheatley the CIO, placed a significant emphasis on the level of detail he was about to share with us.  It was, he said, more than had ever been released outside of an NDA – we shouldn’t take pictures of the slides or record the session. We didn’t, but I’m at a loss as to why this restriction was applied – it served only to make collecting an accurate record of what happen more difficult…

The session that followed described the problem for voice-to-text services: People produce high quality output, but are expensive and slow. Computers produce low quality output, but are cheap and fast.  It was, Rob said, SpinVox’s unique approach to blending the two that was key: “No one else can do this at this scale”.  Later he explained that there were significant benefits from specialising in the ‘voicemail domain’ – it provided a common type of speech and message.  With over 130 million messages converted to-date, this provided a huge ‘corpus’ for the system to draw on.

What followed was a high level summary of the logical components of the SpinVox solution – from the initial Digital Signal Processing, through Automatic Speech Recognition, Quality Checking, language recognition, improvement by a human operator and the presentation of the text output.  Interesting, but little you couldn’t have guessed at… almost entirely reproducible if you’ve read the public patents.

Focusing on the human-assisted element, he introduced ‘Tenzing’ – the call centre operator’s application to correct or replace text the system is not confident in.  During automatic recognition a map of probable word options is generated and the highest probability text is passed to the human transcriber.  As the transcriber reviews the message they over-type incorrect words.  If these match lower-probability options on the map, the surrounding words are also replaced with that alternate ‘path’.  A general set of language rules also offer word and phrase suggestions based on the surrounding words as the operator types – appearing similar to predictive text.  Only as much of the message as needed assistance, we were told, would be sent to the human transcriber (hold that thought… more on that in a minute).

What we didn’t see

After a long time on the  Powerpoint, we moved onto the demo.  Using a test system, the projector displayed a continuously-updating view of what it was doing, logging each step after receiving  message.  Strangely, some of the steps were repeatedly displayed in the wrong order which seemed odd for a demonstration trying to convince us it wasn’t faked.  On the desks SpinVox employees showed the resulting e-mails and the Tenzing application on two laptops.

The first two messages were recorded by Rob and were processed by the system in seconds – the screens showed the processing steps as they went.  He then offered us his Blackberry and Dan Lane recorded a message – it was transcribed in a about 6 seconds, making one mistake as it replaced ‘steak’ with ‘state’.  I can’t prove it, but I do believe it was automatic – the message completely processed in less than the duration of the message and the on-screen log presented matching, timed steps during.

Next Milo Yiannopoulos from Techcrunch tried, leaving a fast message including his name, the word ‘Techcrunch’ and a telephone number.  After a longer delay, the message arrived at transcriber’s laptop, unrecognisable.  The operator listened to it several times scrolling back and forth through the entire audio of the message, eventually writing all of the message by hand and settling on ‘Ianopolis(?)’ as his name.  The outcome was passable, but entirely human transcribed.  The operator would typically, not attempt more than 3 ‘listens’ to a phrase Rob suggested.

SpinVox staff then left a variety of messages, varying complexity.  Although not as simple as Rob’s original messages they felt realistic in content and pace… the room was mostly quiet with occasional conversation in the background – very generous testing conditions.  The vast majority of the messages I observed where passed to the human operator from then on. Of those referred, some contained correctly recognised elements, but all needed significant changes.

We observed that when messages required human intervention they were being passed to the human transcriber in their entirety (albeit, without the caller or recipient’s number) – at odds with previous claims – and asked about SpinVox’s assertion that only the parts of a message needing assistance would be referred to a person.  Several attempts to demonstrate this were made, but only in one case did it appear that an initial ‘hello’ had been omitted – the remainder of the message (and all of the real content of the message) was passed to the employee.

During the demonstration, we noticed that the Tenzing application looked similar, but different to the screen-shot previously leaked on Facebook.  In the top left-hand corner of the application, was shown the version number:

3.0.1 U 20090729

Later we realised what the second number showed… It seems the version we saw may have been created only a week ago.  Far from conclusive proof, but a worrying suggestion that at least part of what we were being shown was not the same technology that had been converting customers’ messages…

The rest of the questions…

CEO Christina Domecq interrupted proceedings briefly, inviting one question which she used to confirm £15m funding from existing investors and to repeat her previous claim that she expects positive cashflow and to scale from 30 million to 100 million users within 90 days. She rejected a claim that the firm was spending £3.5m a month, but refused to say what the figure currently was.

Aside from this there was little time for questions away from the technology.  We clarified that call centre staff work on secure ‘locked down’ workstations without working USB ports or internet browsers and that staff were background-checked prior to working on SpinVox’s systems.  They’re also required to hand-in their own mobile phones before starting work.

No-one knew if the audio and images posted on Facebook, allegedly by an Egyptian call centre during a training and assessment period, were real customers’ data.  We were told it was training data, but no immediate confirmation that customers’ old messages were not used for training was available (SpinVox’s PR team are checking the answer).

SpinVox have previously claimed they retain voicemail data for 90 days before removing it and we queried how this related to the use of stored messages to ‘train’ the system. Rob told us the retention length varied depending on the deals in place with each operator. However, he was unable to tell us how long messages left for direct Spinvox customers were held – this seemed odd given the people present included the most senior responsible for developing and managing the system.

We confirmed that whilst SpinVox holds an ISO27001 security accreditation this covers only their own business and interactions with the 3rd party call centres – these suppliers are not themselves accredited, although SpinVox carry out their own inspections.

We were told that SpinVox actually holds 18 patents, although only 2 of these have been published publicly, suggesting the other 16 have only recently been granted. They have a further 71 pending, but claimed they did not intend to patent all of the innovations they had developed.

A number of senior SpinVox staff claim their own voicemail messages are 100% machine converted. This is true, but only because they are using a prototype system designed to prove that messages can be transcribed without human assistance. We weren’t given any indication that SpinVox intend to offer this system to the public or any details on the quality of the transcription produced.

Our conclusion?

That the system appears to refer all but the most basic messages to transcribers and these messages are reviewed in their entirety with few exceptions.  If you see a word followed by a question mark or three underscores (meaning ‘inaudible’ or ‘incomprehensible’), these have been produced by a person.  SpinVox’s explanation that the database of British-English messages they have to ‘teach’ the system with is the lowest of all those they offer due to the lack of UK carrier deals is may explain the high referral rate, but also indicates this is representative of live operation for UK customers.  Previous coverage quoting figures such as 97% accuracy must now be view in the light of significant human intervention.

The most serious allegations about SpinVox remain unanswered and we are dubious that some of the technology we saw is the same as that serving customers today.

Ben Smith & Dan Lane

