The Google Cloud Speech-to-Text API enables you to convert audio to text by applying neural network models in an easy to use API. The API recognizes over 80 languages and variants, to support your global user base. You can transcribe the text of users dictating to an application’s microphone or enable command-and-control through voice among many other use cases.
The Cloud Speech API provides audio transcription. Its accessible via the
audio_source- this is a local file in the correct format, or a Google Cloud Storage URI. This can also be a
Waveclass object from the package
encoding- the format of the sound file -
LINEAR16is the common
.wavformat, other formats include
sampleRate- this needs to be set to what your file is recorded at.
languageCode- specify the language spoken as a
speechContexts- you can supply keywords to help the translation with some context.
The API returns a list of two data.frame tibbles -
Access them via the returned object and
return <- gl_speech(test_audio, languageCode = "en-GB") return$transcript # A tibble: 1 x 2 # transcript confidence # <chr> <chr> #1 to administer medicine to animals is frequently a very difficult matter and yet sometimes it's necessary to do so 0.9711006 return$timings # startTime endTime word #1 0s 0.100s to #2 0.100s 0.700s administer #3 0.700s 0.700s medicine #4 0.700s 1.200s to # etc...
A test audio file is installed with the package which reads:
“To administer medicine to animals is frequently a very difficult matter, and yet sometimes it’s necessary to do so”
The file is sourced from the University of Southampton’s speech detection (
http://www-mobile.ecs.soton.ac.uk/) group and is fairly difficult for computers to parse, as we see below:
library(googleLanguageR) ## get the sample source file test_audio <- system.file("woman1_wb.wav", package = "googleLanguageR") ## its not perfect but...:) gl_speech(test_audio)$transcript ## get alternative transcriptions gl_speech(test_audio, maxAlternatives = 2L)$transcript gl_speech(test_audio, languageCode = "en-GB")$transcript ## help it out with context for "frequently" gl_speech(test_audio, languageCode = "en-GB", speechContexts = list(phrases = list("is frequently a very difficult")))$transcript
The API supports timestamps on when words are recognised. These are outputted into a second data.frame that holds three entries:
endTime and the
str(result$timings) #'data.frame': 152 obs. of 3 variables: # $ startTime: chr "0s" "0.100s" "0.500s" "0.700s" ... # $ endTime : chr "0.100s" "0.500s" "0.700s" "0.900s" ... # $ word : chr "a" "Dream" "Within" "A" ... result$timings # startTime endTime word #1 0s 0.100s a #2 0.100s 0.500s Dream #3 0.500s 0.700s Within #4 0.700s 0.900s A #5 0.900s 1s Dream
You can also send in other arguments which can help shape the output, such as speaker diagrization (labelling different speakers) - to use such custom configurations create a
RecognitionConfig object. This can be done via R lists which are converted to JSON via
library(jsonlite) and an example is shown below:
## Use a custom configuration my_config <- list(encoding = "LINEAR16", diarizationConfig = list( enableSpeakerDiarization = TRUE, minSpeakerCount = 2, maxSpeakCount = 3 )) # languageCode is required, so will be added if not in your custom config gl_speech(my_audio, languageCode = "en-US", customConfig = my_config)
For speech files greater than 60 seconds of if you don’t want your results straight away, set
asynch = TRUE in the call to the API.
This will return an object of class
"gl_speech_op" which should be used within the
gl_speech_op() function to check the status of the task. If the task is finished, then it will return an object the same form as the non-asynchronous case.