Interactive voice editing of text using new speech technologies from Yandex. "Yandex.Dictation" - turn speech into text Download Yandex dictation for android

  • 06.04.2020

Hello dear readers! Before you is the most unusual article of our blog, because when writing it, voice typing was used. Therefore, today we will discuss how to type text with your voice.

This is a method of typing using speech that is transmitted through a microphone. This topic is very relevant for those who work with large volumes, for example, bloggers, as well as people with disabilities. Or for those who have not yet mastered the computer keyboard.

Voice dialing services

There are services that work online, and there are programs that are installed on a computer.

Online Services

It's free Google development Chrome, which, accordingly, only works in this browser. I think there will be no problems with this, because this is the best browser, and if you still do not use it, then read the article about. Notepad can be installed directly in the browser or you can use voice typing by going to their website.

Similar to the previous service, also works only in Google Chrome. Using it is elementary: we select the desired language and the application prints itself under dictation.

The advantages of this free service are the presence of voice prompts, the ability to view recognition options. There is also a convenient editor with which you can copy the received material, print it on a printer, transfer it to foreign languages or send by mail.

To use dialing without touching the keyboard, you need to open the tab "Tools”, and then click on “Voice input…”

Various commands for editing and formatting are currently only available on English language, but for the Russian language are supported punctuation commands:

  • "dot",
  • "comma",
  • "Exclamation point",
  • "question mark",
  • "new line",
  • "new paragraph"

As practice has shown, it is very convenient.

Programs

A paid program that, with the help of voice, not only prints on a computer and sets punctuation marks, but also pleases additional options: it can audio, you can also expand the version with dictionaries (for example, with legal terms or others).

Supported operating systems, starting with Windows 7 and up.

Price: from 1,690 rubles.

It is free and can satisfy the ambitions of many users. Its attraction lies in the fact that it can recognize the voice in 50 languages. For convenient use, there are hot keys, you can choose the sound source yourself, correct the recognized text.

Pros and cons of voice dialing

Pros:

  • Thanks to these applications, freelancers can make good money doing transcription. Many tasks of this kind can be found on the Work-zilla exchange, which is a favorite place for newcomers to remote work. You just have to turn on the program and correct the text in Word a little later.
  • Saving time and effort.
  • Great find for people with disabilities.
  • For creative people, the above services are a lifesaver, all ideas can be quickly written down, simply by voicing them so as not to forget.

Unfortunately, there are minuses when using these services:

  • If there are extraneous sounds in the room where you dictate, then the recognition of words and phrases deteriorates significantly.
  • Many online applications are only available in the Google Chrome browser.
  • After typing, you need to take time to edit and correct the text.
  • It is necessary to have a high-quality sensitive microphone.
  • It is desirable to have good diction in order to reduce the risk of errors.

Conclusion

Summing up this article, we can say that technology has come a long way, and if earlier everything had to be printed manually, now it is quite possible to do it simply by dictating information by voice. Of course, there is no guarantee of perfect recognition, but the progress is obvious.

With the development of applications that greatly facilitate the work of remote workers, you can achieve maximum productivity and faster tasks. Therefore, we hope that this article will help you become more effective in your work.

Leave your feedback about the work of various speech recognition services in the comments.

All the best!

Hello, friends! Just recently, I described two useful applications: the first of them is a mobile photo sharing application, and the second is in the Google Chrome browser. But, as they say, God loves a trinity, so I decided to make a trilogy anyway and introduce you to one more useful thing - mobile application Yandex.Dictation, which allows you to type text with your voice.

The Yandex.Dictation application is relatively new and is constantly being improved. It will be useful for both schoolchildren and people of many professions, including bloggers. With it, you can dictate any text, describe your impressions, thoughts in words ... in order to later transfer it all to paper, arrange it in a specific article, insert a note into your microblog or personal diary. Since the application itself is mobile, you can use it in any suitable situation and save your time.

In principle, there are many such solutions on the Internet. For example, about which I already wrote. It is, of course, more popular than Yandex.Dictation, and an extension is available for it in Google Chrome, but I have not been able to find its mobile version.

Of course, you will have to finalize the article and place the necessary links already on the computer, but it's still faster than typing all the text by hand. And you don't have to worry about uniqueness.

Finally, I’ll say that, to be honest, you need a more or less decent microphone to work in the Speechpad service. In the case of Yandex.Dictation, there will be no such problems, since the headsets are modern mobile phones have excellent characteristics.

The only problem that really affects speech recognition (and in any such service!) Is the speaker's diction. But this is fixable. Diction can be trained with elementary exercises, for example, such as

Testing new technology speech recognition from Russian company Yandex.


Introduction:

Yandex can safely be called the second most popular search service in Russia, which is actively working on its own developments, including speech recognition. Most recently, Yandex introduced their new application, which is still in the testing stage, but everyone can already try it. The app is called " Yandex.Dictation” and it will definitely surprise you.



Functional:


The main screen of the application shows the entire list of entries you have created, which are additionally sorted by date. If there are a lot of records, then you can easily use the search to find the one you need. Attention! An active internet connection is required to use the application. To create a new entry, just say the phrase “Listen Yandex” or “Yandex record”. After that, you can begin to dictate to him whatever your heart desires, and he will simply write down your thoughts. The main thing is not to speak quietly and more or less clearly. The only thing that is not encouraging so far is that if you said some short sentence, but have not yet finished your thought, Yandex thinks that you have already finished it and the next phrase will start with a capital letter. Unfortunately, this function cannot be disabled, but Yandex cannot be criticized either, because the application is still being tested and will be supplemented and corrected. On the main screen in the sidebar you will find a very interesting section called “Command Examples”. Yes, yes, Yandex will be able to highlight the entire written text or just a word (sentence), delete, copy or voice the written text, as well as many more interesting commands.


Results:


In the settings, you can enable/disable sound effects if they interfere. To summarize: “ Yandex.Dictation” is a great application for quickly writing simple notes, which has a huge potential and believe me, voice control is very addicting. Happy using!

Today our Dictation application for interactive writing and editing text by voice appeared in the AppStore and Google Play. Its main task is to demonstrate some of the new capabilities of the Yandex speech technology complex. It is about what is interesting and unique about our speech recognition and synthesis technologies that I want to talk about in this post.

A couple of words so that you understand what will be discussed. Yandex has long provided a free mobile API that can be used, for example, for address recognition and voice search queries. During this year, we were able to bring its quality to almost the same level at which such requests and remarks are understood by people themselves. And now we are taking the next step - a model for free speech recognition on any topic.

In addition, our speech synthesis supports the emotions in the voice. And, as far as we know, this is the first commercially available speech synthesis with this capability.

Read about all this, as well as some other features of SpeechKit: voice activation, automatic punctuation and recognition of semantic objects in the text - read below.

Omnivorous ASR and recognition quality

The speech recognition system in SpeechKit works with different types text, and Last year we have been working on expanding its scope. To do this, we have created a new language model, the largest so far, for recognizing short texts on any topic.

Over the past year, the relative proportion of erroneously recognized words (Word Error Rate) has decreased by 30%. For example, today SpeechKit correctly recognizes 95% of addresses and geographical objects, coming close to a person who understands 96-98% of the words they hear. The completeness of recognition of the new model for dictation of different texts is now 82%. With this level, you can create a complete solution for end users, which is what we wanted to show on the example of Dictation.

Initially, SpeechKit worked only for search queries: general topics and geo-navigation. Although even then we planned to do not just additional tool input, a "voice" keyboard, but a universal interface that will completely replace any interaction with the system with a live conversation.

To do this, it was necessary to learn to recognize any speech, texts on an arbitrary topic. And we started working on a separate language model for this, which was several times larger than the existing geo-navigation and general search models.

This size of the model set new conditions in terms of computing resources. For each frame, several thousand recognition options are considered - and the more we succeed, the higher the quality. And the system should work in a stream, in real time, so all calculations need to be optimized dynamically. We experimented, tried, looked for an approach: we achieved acceleration, for example, by changing the linear algebra library.

But the most important and most difficult thing was to collect enough correct data suitable for teaching streaming speech. Currently, about 500 hours of hand-transcribed speech are used to train the acoustic model. That's not such a big base - by comparison, the popular science corpus Switchboard, which is often used for research purposes, contains approximately 300 hours of lively, spontaneous conversations. Of course, an increase in the base contributes to an increase in the quality of the trained model, but we focus on the correct preparation of data and accurately model transcriptions, which allows us to train with acceptable quality on a relatively small base.

A few words about how the recognition module works (we talked about this in detail some time ago). The recorded speech stream is cut into frames of 20 ms, the signal spectrum is scaled, and after a series of transformations, MFCCs are obtained for each frame.

The coefficients are fed into the acoustic model, which calculates the probability distribution for approximately 4000 senons in each frame. Senon is the beginning, middle or end of a phoneme.

The SpeechKit acoustic model is built on a combination of hidden Markov models and a deep feedforward neural network (feedforward DNN). This is already a proven solution, and in the last article we talked about how the abandonment of Gaussian mixtures in favor of DNN gave an almost twofold jump in quality.

Then the first language model comes in: several WFSTs - weighted final transducers - turn senones into context-dependent phonemes, and whole words are built from them using the pronunciation dictionary, and hundreds of hypotheses are obtained for each word.

The final processing takes place in the second language model. Connected to it is RNN , a recurrent neural network, and this model ranks the received hypotheses, helping to choose the most plausible option. The recurrent type network is especially effective for the language model. Determining the context of each word, it can take into account the influence of not only the nearest words, as in a feed-forward neural network (say, for a trigram model, these are two previous words), but also further distant ones, as if “remembering” them.

Long connected text recognition is available in SpeechKit Cloud and SpeechKit Mobile SDK - to use the new language model, you need to select the topic "notes" in the query parameters.

Voice activation

The second key component of the voice interface is the voice activation system, which triggers the desired action in response to a key phrase. Without it, it will not be possible to fully “untie the hands” of the user. We have developed our own voice activation module for SpeechKit. The technology is very flexible - a developer using the SpeechKit library can choose any key phrase for his application.

Unlike, for example, Google's solutions, their developers use a deep neural network to recognize the catchphrase "Ok Google". DNN gives high quality, but the activation system is limited to a single command, and a huge amount of data is needed for training. For example, a model for recognizing a familiar phrase was trained on the example of more than 40,000 user voices that accessed their smartphones with Google Now.

With our approach, the voice activation module is, in fact, a miniature recognition system. It only works in harsher conditions. Firstly, command recognition should occur on the device itself, without contacting the server. And the computing power of the smartphone is very limited. Power consumption is also critical - if a regular recognition module is turned on only for a certain time to process a specific request, then the activation module works constantly, in standby mode. And at the same time should not plant the battery.

However, there is an indulgence - the activation system needs a very small dictionary, because it is enough for it to understand a few key phrases, and the rest of the speech can simply be ignored. Therefore, the activation language model is much more compact. Most WFST states correspond to a certain part of our command - for example, "the beginning of the fourth phoneme". There are also "garbage" states that describe silence, extraneous noise, and all other speech other than key phrase. If a full-fledged recognition model in SpeechKit has tens of millions of states and takes up to 10 gigabytes, then for voice activation it is limited to hundreds of states and fits in several tens of kilobytes.

Therefore, a model for recognizing a new key phrase is built without difficulty, allowing you to quickly scale the system. There is one condition - the command must be long enough (preferably - more than one word) and rarely occur in everyday speech in order to exclude false positives. “Please” is not good for voice activation, but “listen to my command” is fine.

Together with a limited language model and "light" acoustic, command recognition is within the power of any smartphone. It remains to deal with energy consumption. The system has a built-in voice activity detector, which monitors the appearance of a human voice in the incoming audio stream. Other sounds are ignored, so in the background the power consumption of the activation module is limited to the microphone only.

speech synthesis

The third main component of speech technology is speech synthesis (text-to-speech). The TTS solution SpeechKit allows you to voice any text in a male or female voice, and even set the desired emotion. None of the known voice engines on the market has this capability.

There are several fundamentally different speech synthesis technologies, and in most modern systems concatenative synthesis is used by the "unit selection" method. The pre-recorded voice sample is cut into specific constituent elements(for example, context-dependent phonemes) from which the speech base is composed. Then any desired words are assembled from individual units. It turns out a believable imitation of a human voice, but it's hard to perceive it - the timbre jumps, unnatural intonations and sharp transitions appear at the junctions of individual units. This is especially noticeable when voicing a long connected text. The quality of such a system can be improved by increasing the volume of the speech base, but this is a long and painstaking work that requires the involvement of a professional and very patient speaker. And the completeness of the base always remains the bottleneck of the system.

In SpeechKit, we decided to use statistical (parametric) speech synthesis based on hidden Markov models. The process is essentially similar to recognition, only it happens in the opposite direction. The original text is passed to the G2P (grapheme-to-phoneme) module, where it is converted into a sequence of phonemes.

Then they get into the acoustic model, which generates vectors that describe the spectral characteristics of each phoneme. These numbers are passed to the vocoder, which synthesizes the sound.

The timbre of such a voice is somewhat "computer", but it has natural and smooth intonations. At the same time, the smoothness of speech does not depend on the volume and length of the text being read, and the voice is easy to adjust. It is enough to specify one key in the request parameters, and the synthesis module will produce a voice with the corresponding emotional coloring. Of course, no unit selection system can do this.

In order for the voice model to be able to build algorithms corresponding to various emotions, it was necessary to train it in the right way. Therefore, during the recording, our colleague Evgenia, whose voice can be heard in SpeechKit, uttered her lines in turn in a neutral voice, joyful and, on the contrary, annoyed. In the course of training, the system identified and described the parameters and characteristics of the voice corresponding to each of these states.

Not all voice modifications are built on learning. For example, SpeechKit also allows you to color the synthesized voice with the "drunk" and "ill" parameters. Our developers felt sorry for Zhenya, and she did not have to get drunk before recording or run in the cold to get a good cold.

For a drunken voice, speech is slowed down in a special way - each phoneme sounds about twice as slow, which gives a characteristic effect. And for the patient, the threshold of sonority rises - in fact, what happens to the vocal cords of a person with laryngitis is modeled. The sonority of different phonemes depends on whether the air passes through the human vocal tract freely or whether the vibrating vocal cords are in the way. In the "disease" mode, each phoneme is less likely to be voiced, which makes the voice hoarse, planted.

The statistical method also allows for rapid expansion of the system. In the unit selection model, to add a new voice, you need to create a separate speech base. The announcer must record many hours of speech, while maintaining the same intonation flawlessly. In SpeechKit, to create a new voice, it is enough to record at least two hours of speech - approximately 1800 special, phonetically balanced sentences.

Isolation of semantic objects

It is important not only to translate the words that a person utters into letters, but also to fill them with meaning. The fourth technology, which is available in a limited form in SpeechKit Cloud, does not directly deal with voice work - it starts working after the spoken words are recognized. But without it, a complete stack of speech technologies cannot be made - this is the selection of semantic objects in natural speech, which at the output gives not just recognized, but already marked up text.

Now SpeechKit implements the selection of dates and times, full names, addresses. The hybrid system combines context-free grammars, dictionaries keywords and statistical data of search and various Yandex services, as well as machine learning algorithms. For example, in the phrase "Let's go to Leo Tolstoy Street", the word "street" helps the system determine the context, after which the corresponding object is located in the Yandex.Maps database.

In Dictation, we have built on this technology the text editing function by voice. The approach to extracting entities is fundamentally new, and the emphasis is on the simplicity of configuration - you do not need to know programming to set up the system.

The system input is a list of different types of objects and examples of phrases from live speech that describe them. Further, patterns are formed from these examples using the Pattern Mining method. They take into account the initial form, roots, morphological variations of words. The next step is to give examples of the use of the selected objects in different combinations, which will help the system understand the context. Based on these examples, a hidden Markov model is built, where the objects selected in the user's replica become observable states, and the objects corresponding to them from the subject field with an already known value become hidden states.

For example, there are two phrases: "insert 'hello friend' at the beginning" and "paste from clipboard". The system determines that in the first case after "paste" (editing action) there is an arbitrary text, and in the second - an object known to it ("clipboard"), and reacts differently to these commands. In the traditional system, this would require writing rules or grammars manually, but in the new Yandex technology, context analysis occurs automatically.

Autopunctuation

When dictating something, you expect to see punctuation marks in the resulting text. And they should appear automatically so that you do not have to talk to the interface in a telegraphic style: "Dear friend - comma - how are you - question mark." Therefore, SpeechKit is complemented by an automatic punctuation system.

The role of punctuation marks in speech is played by intonational pauses. Therefore, initially we tried to build a complete acoustic and language model for their recognition. Each punctuation mark was assigned a phoneme, and from the point of view of the system, new “words” appeared in the recognized speech, consisting entirely of such “punctuation” phonemes - where there were pauses or intonation changed in a certain way.

A great difficulty arose with the data for training - in most corpora there are already normalized texts in which punctuation marks are omitted. Also, there is almost no punctuation in the texts of search queries. We turned to Ekho Moskvy, who manually transcribe all their broadcasts, and they allowed us to use their archive. It quickly became clear that these transcriptions were unsuitable for our purposes - they were made close to the text, but not verbatim, and therefore were not suitable for machine learning. The next attempt was made with audiobooks, but in their case, on the contrary, the quality was too high. Well-placed voices, expressively reciting the text, are too far from real life, and the results of training on such data could not be applied in spontaneous dictation.

The second problem was that the chosen approach had a negative impact on general quality recognition. For each word, the language model considers several neighboring words in order to correctly determine the context, and additional "punctuation" words inevitably narrowed it. Several months of experimentation did not lead to anything.

We had to start from scratch - we decided to put punctuation marks already at the post-processing stage. We started with one of the simplest methods, which, oddly enough, showed quite acceptable results in the end. Pauses between words receive one of the marks: space, period, comma, question mark, exclamation point, colon. To predict which label corresponds to a particular pause, the conditional random fields (CRF) method is used. To determine the context, three preceding and two subsequent words are taken into account, and these simple rules allow you to place signs with a fairly high accuracy. But we continue to experiment with full-fledged models that will be able to correctly interpret human intonations in terms of punctuation even at the stage of voice recognition.

Future plans

Today, SpeechKit is actively used to solve "combat" tasks in mass services for end users. The next milestone is to learn to recognize spontaneous speech in a live stream so that you can transcribe an interview in real time or automatically take notes on a lecture, receiving an already marked up text with highlighted abstracts and key facts. This is a huge and very science-intensive task that no one in the world has managed to solve yet - and we don’t like others!

For the development of SpeechKit is very important Feedback. Put

Yandex has released a new Yandex.Dictation application that allows you to evaluate the company's speech technologies. The program records texts from dictation and executes voice commands. Now the user does not have to touch the keyboard to write a note or a short message.


Yandex.Dictation uses technologies from the Yandex SpeechKit cloud-based voice recognition platform, including voice activation, speech recognition, voice control, punctuation, and speech synthesis. Yandex SpeechKit is designed to work with Russian and Turkish, it supports short queries of any subject, geoqueries and short text dictation. According to Yandex, the delay in recognition does not exceed one second.


All texts typed by voice are automatically saved in the application, and after authorization in the Yandex.Disk service. Any entry can be sent via SMS, by mail or published on social networks.

In order for the application to understand the user well, you need to dictate clearly, into the microphone, separating words from each other and pronouncing the endings. If a phrase was recognized incorrectly, it can be corrected using the "Corrector" button - this will help improve the quality of recognition.


Yandex.Dictation allows you to edit typed text using your voice. For example, you can say "Delete the last word," "Start on a new line," or "Add a funny emoji." The application not only recognizes words, but also understands their meaning, so the list of commands is not limited. The application also focuses on pauses in speech and places punctuation marks.