R Wrapper for Google’s Compact Language Detector 3
Google’s Compact Language Detector 3 is a neural network model for language identification and the successor of CLD2 (available from) CRAN. This version is still experimental and uses a novell algorithm with different properties and outcomes. For more information see: https://github.com/google/cld3#readme
detect_language() is vectorised and guesses the the language of each string in text or returns NA if the language could not reliably be determined.
> library(cld3) > example(cld3) cld3> # Vectorized best guess cld3> detect_language(c("To be or not to be?", "Ce n'est pas grave.", "猿も木から落ちる"))  "en" "fr" "ja"
detect_language_multi() is not vectorised and detects all languages inside the entire character vector as a whole.
cld3> # Multiple languages in one text cld3> detect_language_mixed("This piece of text is in English. Този текст е на Български.", size = 3) language probability reliable proportion 1 bg 0.9173891 TRUE 0.5853658 2 en 0.9999790 TRUE 0.4146341 3 und 0.0000000 FALSE 0.0000000
Binary packages for OS-X or Windows can be installed directly from CRAN:
sudo apt-get install -y libprotobuf-dev protobuf-compiler
On Fedora we need protobuf-devel:
sudo yum install protobuf-devel
On CentOS / RHEL we install protobuf-devel via EPEL:
sudo yum install epel-release sudo yum install protobuf-devel
On OS-X use protobuf from Homebrew:
brew install protobuf