BugZero | MongoDB BugID 399222 - stemming behavior for diacritics causes incorrect ...

MongoDB - Defect ID: 399222

stemming behavior for diacritics causes incorrect results

MongoDB - Defect ID: 399222

stemming behavior for diacritics causes incorrect results

Last updated on 10/27/2023

Overall: 6.16.1

Severity: 6.46.4

Community: 7.47.4

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.16.1

Severity: 6.46.4

Community: 7.47.4

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

$text search is not diacritic insensitive if the word contains a dieresis ( ¨ ). Dieresis is categorized as diacritic in Unicode 8.0 Character Database Prop List, cf http://www.unicode.org/Public/8.0.0/ucd/PropList.txt Search with collation works fine with strength = 1

Top User Comments

kyle.suarez commented on Thu, 9 Nov 2017 21:49:04 +0000: I've taken another look at the issue here and thoroughly examined what happens with regard to stemming and diacritic stripping. As Dan mentions, the stemmer must be diacritic-sensitive because diacritics affect stemming, even in languages like English. For example, in English: "resume" is stemmed to "resum", as you'd expect. Its conjugated forms "resumed", "resuming", etc. all have the same stem. "résumé" is stemmed simply to itself, as it is a noun and has no simpler form. The current text search engine is written in a way that errs on the side of "correctness". That being said, I am definitely sympathetic to the argument that "résumé" is commonly spelled as "resume" in everyday speech. However, changing the way text search works with regard to stemming and diacritic stripping will require a much larger project and detailed design. Based on this assessment, I'm going to close this ticket as Works as Designed. For now, truly diacritic-insensitive queries should use either collation or a text index language of "none" (but will lose out on the benefit of stemming). kyle.suarez commented on Thu, 10 Aug 2017 14:19:46 +0000: It still seems like something is off here, though, like there is an inconsistent approach to the way we perform diacritic stripping and stemming. In any case, I've stopped investigating this ticket as the Query Team's priority is on 3.6 scheduled features. It does seem worth it, though, for someone to investigate this behavior further once we revisit the tickets on the backlog. dan@10gen.com commented on Thu, 3 Aug 2017 17:59:37 +0000: The stemmer is diacritic sensitive and it must be because accents have meaning in some languages. See this comment: https://github.com/mongodb/mongo/blob/r3.5.10/src/mongo/db/fts/fts_unicode_tokenizer.cpp#L96 kyle.suarez commented on Tue, 25 Jul 2017 19:51:26 +0000: Good point... I tried std::cout << "Stemmed version of iphone: " << s.stem("iphone") << std::endl; std::cout << "Stemmed version of iphoné: " << s.stem("iphoné") << std::endl; std::cout << "Stemmed version of iphonë: " << s.stem("iphonë") << std::endl; and got Stemmed version of iphone: iphon Stemmed version of iphoné: iphoné Stemmed version of iphonë: iphonë I'm putting this ticket into Needs Triage, so that the query team can triage this ticket at the next planning meeting. Whoever picks up this ticket should look at the places where stemming happens and see if we can strip diacritics before it occurs. thomas.schubert commented on Tue, 25 Jul 2017 19:46:36 +0000: I'd argue the problem isn't that we stem "iphone" to "iphon" (the stem doesn't have to be a valid root), but that we don't stem "iphoné" to "iphon". If our stemming isn't diacritic insensitive, our queries can't be. Can we change "iphoné" to "iphone" before it reaches the stemmer so it generates the same root? kyle.suarez commented on Tue, 25 Jul 2017 19:18:58 +0000: After some investigation, I've found that the problem is not a diacritic problem, but a stemming problem. In your text search, the default language is English. Unfortunately, our vendored third-party stemming library, libstemmer.c, stems the word "iphone" into "iphon" when in English mode. Thus, it cuts off the "e" completely and is not included in the search. When changing the language to "none", stemming does not occur, and I find the results as usual. > db.text.find() { "_id" : ObjectId("59778fac798c05e256b74092"), "t" : "iphone" } { "_id" : ObjectId("59778faf798c05e256b74093"), "t" : "iphoné" } { "_id" : ObjectId("59778fb2798c05e256b74094"), "t" : "iphonë" } > db.text.find({$text: {$search: "iphone"}}) { "_id" : ObjectId("59778fac798c05e256b74092"), "t" : "iphone" } > db.text.find({$text: {$search: "iphone", $language: "none"}}) { "_id" : ObjectId("59778ef3798c05e256b74086"), "t" : "iphonë" } { "_id" : ObjectId("59778faf798c05e256b74093"), "t" : "iphoné" } { "_id" : ObjectId("59778fb2798c05e256b74094"), "t" : "iphonë" } ian@10gen.com commented on Fri, 14 Jul 2017 14:43:28 +0000: Reminder: kyle.suarez please review to see if you can find the underlying cause. thomas.schubert commented on Thu, 29 Jun 2017 15:22:57 +0000: Hi felix2626, Thank you for reporting this issue. I've marked this ticket to be scheduled against currently planned work. Please continue to watch this ticket for updates. Kind regards, Thomas

Steps to Reproduce

> db.test.insertMany([ { "_id":1, "name":"iphone" }, { "_id":2, "name":"iphône" }, { "_id":3, "name":"iphonë" }, { "_id":4, "name":"iphônë" } ]) > db.test.ensureIndex({name: "text"}) > db.test.find({$text: {$search: "iphone"}}) { "_id" : 1, "name" : "iphone" } { "_id" : 2, "name" : "iphône" } > db.test.find({name: "iphone"}).collation({locale: "en", strength: 1}) { "_id" : 1, "name" : "iphone" } { "_id" : 2, "name" : "iphône" } { "_id" : 3, "name" : "iphonë" } { "_id" : 4, "name" : "iphônë" }

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 399222

stemming behavior for diacritics causes incorrect results

MongoDB - Defect ID: 399222

stemming behavior for diacritics causes incorrect results

Last updated on 10/27/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?