mastodon/app/lib/language_detector.rb

# frozen_string_literal: true

class LanguageDetector
  include Singleton

  WORDS_THRESHOLD        = 4
  RELIABLE_CHARACTERS_RE = /[\p{Hebrew}\p{Arabic}\p{Syriac}\p{Thaana}\p{Nko}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}\p{Thai}]+/m

  def initialize
    @identifier = CLD3::NNetLanguageIdentifier.new(1, 2048)
  end

  def detect(text, account)
    input_text = prepare_text(text)

    return if input_text.blank?

    detect_language_code(input_text) || default_locale(account)
  end

  def language_names
    @language_names = CLD3::TaskContextParams::LANGUAGE_NAMES.map { |name| iso6391(name.to_s).to_sym }.uniq
  end

  private

  def prepare_text(text)
    simplify_text(text).strip
  end

  def unreliable_input?(text)
    !reliable_input?(text)
  end

  def reliable_input?(text)
    sufficient_text_length?(text) || language_specific_character_set?(text)
  end

  def sufficient_text_length?(text)
    text.split(/\s+/).size >= WORDS_THRESHOLD
  end

  def language_specific_character_set?(text)
    words = text.scan(RELIABLE_CHARACTERS_RE)

    if words.present?
      words.reduce(0) { |acc, elem| acc + elem.size }.to_f / text.size > 0.3
    else
      false
    end
  end

  def detect_language_code(text)
    return if unreliable_input?(text)

    result = @identifier.find_language(text)

    iso6391(result.language.to_s).to_sym if result&.reliable?
  end

  def iso6391(bcp47)
    iso639 = bcp47.split('-').first

    # CLD3 returns grandfathered language code for Hebrew
    return 'he' if iso639 == 'iw'

    ISO_639.find(iso639).alpha2
  end

  def simplify_text(text)
    new_text = remove_html(text)
    new_text.gsub!(FetchLinkCardService::URL_PATTERN, '')
    new_text.gsub!(Account::MENTION_RE, '')
    new_text.gsub!(Tag::HASHTAG_RE) { |string| string.gsub(/[#_]/, '#' => '', '_' => ' ').gsub(/[a-z][A-Z]|[a-zA-Z][\d]/) { |s| s.insert(1, ' ') }.downcase }
    new_text.gsub!(/:#{CustomEmoji::SHORTCODE_RE_FRAGMENT}:/, '')
    new_text.gsub!(/\s+/, ' ')
    new_text
  end

  def new_scrubber
    scrubber = Rails::Html::PermitScrubber.new
    scrubber.tags = %w(br p)
    scrubber
  end

  def scrubber
    @scrubber ||= new_scrubber
  end

  def remove_html(text)
    text = Loofah.fragment(text).scrub!(scrubber).to_s
    text.gsub!('<br>', "\n")
    text.gsub!('</p><p>', "\n\n")
    text.gsub!(/(^<p>|<\/p>$)/, '')
    text
  end

  def default_locale(account)
    account.user_locale&.to_sym || I18n.default_locale if account.local?
  end
end
Language detection refactor (#2099) * Extract detect_language to separate class * Use default locale, not just en * Add spec to confirm that whatlanguage cant identify empty string * Allow account locale to override default in language detector * PostStatusService supplies an account to detect language 2017-04-18 22:20:12 +02:00			`# frozen_string_literal: true`

			`class LanguageDetector`
Fix filterable_languages method of SettingsHelper (#4966) 2017-09-16 14:59:41 +02:00			`include Singleton`
Language detection refactor (#2099) * Extract detect_language to separate class * Use default locale, not just en * Add spec to confirm that whatlanguage cant identify empty string * Allow account locale to override default in language detector * PostStatusService supplies an account to detect language 2017-04-18 22:20:12 +02:00
Change language detector threshold from 140 characters to 4 words (#10376) Add `lang` attribute to statuses in web UI 2019-03-26 01:23:59 +01:00			`WORDS_THRESHOLD = 4`
Fix Thai being skipped from language detection (#13989) Thai does not separate words by spaces, so I figured out it should be in 'reliable characters regexp' that denotes languages that do the same. Related #13891. 2020-06-25 22:45:01 +02:00			`RELIABLE_CHARACTERS_RE = /[\p{Hebrew}\p{Arabic}\p{Syriac}\p{Thaana}\p{Nko}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}\p{Thai}]+/m`
Disable language detection for texts shorter than 140 characters (#8010) If the input text is blank after preparation (only mention, or only URL, or empty as in a media post), then use nil as language, since it's OK to show to everyone. Otherwise, always fall back to the server's default locale 2018-07-14 04:05:36 +02:00
Fix filterable_languages method of SettingsHelper (#4966) 2017-09-16 14:59:41 +02:00			`def initialize`
Use CLD3 (#2949) Compact Language Detector v3 (CLD3) is the successor of CLD2, which was used in the previous implementation. CLD3 includes improvements since CLD2, and supports newer compilers. On the other hand, it has additional requirements and cld3-ruby, the FFI of CLD3 for Ruby, is still new and may be still inmature. Though CLD3 is named after CLD2, it is implemented with a neural network model, different from the old implementation, which is based on a Naïve Bayesian classifier. CLD3 supports newer compilers, such as GCC 6. CLD2 is not compatible with GCC 6 because it assigns negative values to varibales typed unsigned. (see internal/cld_generated_cjk_uni_prop_80.cc) The support for GCC 6 and newer compilers are essential today, when some server operating system such as Ubuntu Server 16.10 has GCC 6 by default. On the one hand, CLD3 requires C++11 support. Environments with old compilers such as Ubuntu Server 14.04 needs to update the system or install a newer compiler. CLD3 needs protocol buffers as a new dependency. However,it is not considered problematic because major server operating systems, CentOS and Ubuntu Server provide them. The FFI cld3-ruby was written by me (Akihiko Odaki) for use in Mastodon. It is still new and may be inmature, but confirmed to pass existing tests. 2017-05-09 19:58:03 +02:00			`@identifier = CLD3::NNetLanguageIdentifier.new(1, 2048)`
Language detection refactor (#2099) * Extract detect_language to separate class * Use default locale, not just en * Add spec to confirm that whatlanguage cant identify empty string * Allow account locale to override default in language detector * PostStatusService supplies an account to detect language 2017-04-18 22:20:12 +02:00			`end`

Fix filterable_languages method of SettingsHelper (#4966) 2017-09-16 14:59:41 +02:00			`def detect(text, account)`
Disable language detection for texts shorter than 140 characters (#8010) If the input text is blank after preparation (only mention, or only URL, or empty as in a media post), then use nil as language, since it's OK to show to everyone. Otherwise, always fall back to the server's default locale 2018-07-14 04:05:36 +02:00			`input_text = prepare_text(text)`
Fix language detection of non-latin alphabets even at few characters (#10276) 2019-03-15 05:07:09 +01:00
Disable language detection for texts shorter than 140 characters (#8010) If the input text is blank after preparation (only mention, or only URL, or empty as in a media post), then use nil as language, since it's OK to show to everyone. Otherwise, always fall back to the server's default locale 2018-07-14 04:05:36 +02:00			`return if input_text.blank?`
Leave unknown language as nil if account is remote (#8861) * Force use language detector if account is remote * Set unknown remote toot's language as nil 2018-10-05 19:17:46 +02:00
Disable language detection for texts shorter than 140 characters (#8010) If the input text is blank after preparation (only mention, or only URL, or empty as in a media post), then use nil as language, since it's OK to show to everyone. Otherwise, always fall back to the server's default locale 2018-07-14 04:05:36 +02:00			`detect_language_code(input_text) \|\| default_locale(account)`
Language detection refactor (#2099) * Extract detect_language to separate class * Use default locale, not just en * Add spec to confirm that whatlanguage cant identify empty string * Allow account locale to override default in language detector * PostStatusService supplies an account to detect language 2017-04-18 22:20:12 +02:00			`end`

Fix filterable_languages method of SettingsHelper (#4966) 2017-09-16 14:59:41 +02:00			`def language_names`
Fix language detection of non-latin alphabets even at few characters (#10276) 2019-03-15 05:07:09 +01:00			`@language_names = CLD3::TaskContextParams::LANGUAGE_NAMES.map { \|name\| iso6391(name.to_s).to_sym }.uniq`
Remove usernames and hashtags from language detection (#3503) * Add failing specs for hashtag and username extraction in language detector * Remove usernames and hashtags from text before language detection * Handle multiple instances of special case, and reduce whitespace 2017-06-01 15:29:14 +02:00			`end`

Language detection refactor (#2099) * Extract detect_language to separate class * Use default locale, not just en * Add spec to confirm that whatlanguage cant identify empty string * Allow account locale to override default in language detector * PostStatusService supplies an account to detect language 2017-04-18 22:20:12 +02:00			`private`

Fix filterable_languages method of SettingsHelper (#4966) 2017-09-16 14:59:41 +02:00			`def prepare_text(text)`
			`simplify_text(text).strip`
			`end`

Disable language detection for texts shorter than 140 characters (#8010) If the input text is blank after preparation (only mention, or only URL, or empty as in a media post), then use nil as language, since it's OK to show to everyone. Otherwise, always fall back to the server's default locale 2018-07-14 04:05:36 +02:00			`def unreliable_input?(text)`
Fix language detection of non-latin alphabets even at few characters (#10276) 2019-03-15 05:07:09 +01:00			`!reliable_input?(text)`
			`end`

			`def reliable_input?(text)`
			`sufficient_text_length?(text) \|\| language_specific_character_set?(text)`
			`end`

			`def sufficient_text_length?(text)`
Change language detector threshold from 140 characters to 4 words (#10376) Add `lang` attribute to statuses in web UI 2019-03-26 01:23:59 +01:00			`text.split(/\s+/).size >= WORDS_THRESHOLD`
Fix language detection of non-latin alphabets even at few characters (#10276) 2019-03-15 05:07:09 +01:00			`end`

			`def language_specific_character_set?(text)`
			`words = text.scan(RELIABLE_CHARACTERS_RE)`

			`if words.present?`
Update ESLint and RuboCop in Code Climate (#12534) 2019-12-02 18:25:43 +01:00			`words.reduce(0) { \|acc, elem\| acc + elem.size }.to_f / text.size > 0.3`
Fix language detection of non-latin alphabets even at few characters (#10276) 2019-03-15 05:07:09 +01:00			`else`
			`false`
			`end`
Disable language detection for texts shorter than 140 characters (#8010) If the input text is blank after preparation (only mention, or only URL, or empty as in a media post), then use nil as language, since it's OK to show to everyone. Otherwise, always fall back to the server's default locale 2018-07-14 04:05:36 +02:00			`end`

Fix filterable_languages method of SettingsHelper (#4966) 2017-09-16 14:59:41 +02:00			`def detect_language_code(text)`
Disable language detection for texts shorter than 140 characters (#8010) If the input text is blank after preparation (only mention, or only URL, or empty as in a media post), then use nil as language, since it's OK to show to everyone. Otherwise, always fall back to the server's default locale 2018-07-14 04:05:36 +02:00			`return if unreliable_input?(text)`
Bump cld3 from 3.2.6 to 3.3.0 (#13107) * Bump cld3 from 3.2.6 to 3.3.0 Bumps [cld3](https://github.com/akihikodaki/cld3-ruby) from 3.2.6 to 3.3.0. - [Release notes](https://github.com/akihikodaki/cld3-ruby/releases) - [Commits](https://github.com/akihikodaki/cld3-ruby/compare/v3.2.6...v3.3.0) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Fix compatibility with cld3 3.3.0 Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com> Co-authored-by: Eugen Rochko <eugen@zeonfederated.com> 2020-03-09 00:12:52 +01:00
Disable language detection for texts shorter than 140 characters (#8010) If the input text is blank after preparation (only mention, or only URL, or empty as in a media post), then use nil as language, since it's OK to show to everyone. Otherwise, always fall back to the server's default locale 2018-07-14 04:05:36 +02:00			`result = @identifier.find_language(text)`
Bump cld3 from 3.2.6 to 3.3.0 (#13107) * Bump cld3 from 3.2.6 to 3.3.0 Bumps [cld3](https://github.com/akihikodaki/cld3-ruby) from 3.2.6 to 3.3.0. - [Release notes](https://github.com/akihikodaki/cld3-ruby/releases) - [Commits](https://github.com/akihikodaki/cld3-ruby/compare/v3.2.6...v3.3.0) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Fix compatibility with cld3 3.3.0 Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com> Co-authored-by: Eugen Rochko <eugen@zeonfederated.com> 2020-03-09 00:12:52 +01:00
			`iso6391(result.language.to_s).to_sym if result&.reliable?`
Fix language filter codes (#4841) * Fix language filter codes CLD3 returns BCP-47 language identifier, filter settings expect identifiers in the ISO 639-1 format. Convert between formats, and exclude duplicate languages from filter choices (zh-CN->zh) * Fix zh name 2017-09-08 12:32:22 +02:00			`end`

			`def iso6391(bcp47)`
			`iso639 = bcp47.split('-').first`

			`# CLD3 returns grandfathered language code for Hebrew`
			`return 'he' if iso639 == 'iw'`

			`ISO_639.find(iso639).alpha2`
Language improvements, replace whatlanguage with CLD (#2753) * add failing en specs * add cld2 gem * Replace WhatLanguage with CLD 2017-05-03 16:59:31 +02:00			`end`

Fix filterable_languages method of SettingsHelper (#4966) 2017-09-16 14:59:41 +02:00			`def simplify_text(text)`
Improve language filter (#5724) * Scrub text of html before detecting language. * Detect language on statuses coming from activitypub. * Fix rubocop comments. * Remove custom emoji from text before language detection 2017-11-16 13:51:38 +01:00			`new_text = remove_html(text)`
			`new_text.gsub!(FetchLinkCardService::URL_PATTERN, '')`
			`new_text.gsub!(Account::MENTION_RE, '')`
Change language detection to include hashtags as words (#11341) 2019-07-18 03:02:15 +02:00			`new_text.gsub!(Tag::HASHTAG_RE) { \|string\| string.gsub(/[#_]/, '#' => '', '_' => ' ').gsub(/[a-z][A-Z]\|[a-zA-Z][\d]/) { \|s\| s.insert(1, ' ') }.downcase }`
Improve language filter (#5724) * Scrub text of html before detecting language. * Detect language on statuses coming from activitypub. * Fix rubocop comments. * Remove custom emoji from text before language detection 2017-11-16 13:51:38 +01:00			`new_text.gsub!(/:#{CustomEmoji::SHORTCODE_RE_FRAGMENT}:/, '')`
			`new_text.gsub!(/\s+/, ' ')`
			`new_text`
			`end`

			`def new_scrubber`
			`scrubber = Rails::Html::PermitScrubber.new`
			`scrubber.tags = %w(br p)`
			`scrubber`
			`end`

			`def scrubber`
			`@scrubber \|\|= new_scrubber`
			`end`

			`def remove_html(text)`
			`text = Loofah.fragment(text).scrub!(scrubber).to_s`
			`text.gsub!('<br>', "\n")`
			`text.gsub!('</p><p>', "\n\n")`
			`text.gsub!(/(^<p>\|<\/p>$)/, '')`
			`text`
[WIP] Html lang on statuses (#2297) * Add html lang attributes around statuses * Remove urls from language detection 2017-04-22 04:26:25 +02:00			`end`

Fix filterable_languages method of SettingsHelper (#4966) 2017-09-16 14:59:41 +02:00			`def default_locale(account)`
Fix language detection of non-latin alphabets even at few characters (#10276) 2019-03-15 05:07:09 +01:00			`account.user_locale&.to_sym \|\| I18n.default_locale if account.local?`
Language detection refactor (#2099) * Extract detect_language to separate class * Use default locale, not just en * Add spec to confirm that whatlanguage cant identify empty string * Allow account locale to override default in language detector * PostStatusService supplies an account to detect language 2017-04-18 22:20:12 +02:00			`end`
			`end`