CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes (e.g. 80% English and 20% French out of 1000 bytes)
This package includes a bundled version of libcld2:
detect_language() returns the best guess or NA if the language could not reliablity be determined.
plain_text = FALSE if your input contains HTML:
detect_language_multi() to get detailed classification output.
This shows the top 3 language guesses and the proportion of text that was classified as this language. The
bytes attribute shows the total number of text bytes that was classified, and
reliable is a complex calculation on if the #1 language is some amount more probable then the second-best Language.