...
When I query a time-series collection with the collation of an index, but not the collection, the query fails. For example: > db.createCollection('ts', {timeseries: {timeField: 't', metaField: 'm'}, collation: {locale: 'en'}}) { "ok" : 1 } > db.ts.createIndex({m: 1, t: 1}, {collation: {locale: 'simple'}}) { "numIndexesBefore" : 0, "numIndexesAfter" : 1, "createdCollectionAutomatically" : false, "ok" : 1 } > db.ts.insert({t: new Date(), m: 1}) WriteResult({ "nInserted" : 1 }) > db.ts.find().collation({locale: 'en'}) { "t" : ISODate("2021-03-30T18:33:03.090Z"), "m" : 1, "_id" : ObjectId("60636edfd8904012f6ce383d") } > db.ts.find().collation({locale: 'simple'}) Error: error: { "ok" : 0, "errmsg" : "Cannot override a view's default collation", "code" : 167, "codeName" : "OptionNotSupportedOnView" } I think we should allow queries on time-series views to use a user-provided collation if one is supported by an index.
katherine.wu commented on Thu, 29 Apr 2021 18:06:18 +0000: Closing this as a duplicate of SERVER-54597 after agreeing that we will allow users to be able to query on time-series collections with any collation, not just the default or simple collation. katherine.wu commented on Thu, 1 Apr 2021 13:58:43 +0000: I agree that it makes sense for queries on time-series collections to be able to specify a collation, especially if there is an index on that collation. Regarding the interaction of a user-provided collation with query rewrites: currently there is an optimization that attempts to map user predicates to predicates on the bucket's control.min and control.max fields. Since the buckets' control.min and control.max fields are chosen based on the underlying collection's collation, once we allow queries with user-provided collation we'd need to make sure that if there's a collation mismatch between the user-provided and the collection default (and that the predicate is of a type that is affected by collation - ie. string or compound type) that we don't perform this optimization. I filed SERVER-54597 to track similar work. charlie.swanson commented on Thu, 1 Apr 2021 13:38:37 +0000: louis.williams let me try to explain a little more. A time-series collection's bucketing and unpacking generally should be transparent to the user, this I understand. There are at least two cases where this should be carefully examined with respect to collation though - one I mentioned above and the other I just thought of: 1. (above) when we track each bucket's min and max value for each measurement, we are making comparisons, potentially between strings. Should these comparisons respect the collection's default collation? 2. When we are deciding to bucket measurements together we are again making comparisons on values which are potentially strings. Should these comparisons respect the collection's default collation? I can see an argument for or against each use case. If you generally expect every user of the collection to approach with the same collation - for example: it should always be case insensitive or it is always french speakers - you would want the bucketing, min and max to all use that collation for predictable and understandable semantics. Our optimizations would all be able to work on strings if the min and max use the same collation that you use at query time (the default). However, if you expect people to query the collection with different collations, then using the simple collation for all time-series internal comparisons makes more sense. You would allow any collation to be used to query the collection. The bucketing may miss some cases where a user expected them to be collapsed (maybe they inserted meta: "US" and meta: "us" interchangeably), but it would generally all still work. Here our optimizations utilizing min and max would have to be more defensive and likely wouldn't be able to optimize predicates against strings. Is that making more sense? I have drifted beyond the original purpose of this ticket for sure. I apologize for that. The description/summary made me think of these edge cases. For this ticket it seems like we should be able to relax the assertion that you can't query the time-series collection with a different collation, but we may need to re-examine some of our query rewrites. I know something like this came up during development though, so I'm hopeful that Katherine or Hana can comment on how well we are prepared for mixed collations here. cc jacob.evans and david.storch who I think were involved in a code review about the subject also. louis.williams commented on Thu, 1 Apr 2021 13:22:07 +0000: charlie.swanson, Time-series collection creation is special in that we don't give users control over both the buckets collection collation and the view's collation. And as much as possible, we are trying to have time-series collections (i.e. views) behave like regular collections. As of SERVER-55591, the collation passed when creating a time-series collection is stored as both the collection default and the view definition. There should never be a case where the time-series view collation differs from the buckets collection collation. In that ticket, writes use the collection's collation for bucketing, and reads use the view's collation (even though they are the same). I can see another option where maybe the time-series view definition doesn't store a collation at all? What would that mean for queries? charlie.swanson commented on Wed, 31 Mar 2021 22:06:04 +0000: There's some interesting related discussion in SERVER-27762 that I'd recommend reading about. In many cases with views we are reluctant to allow mixing collations because it's plausible that the author of the view didn't want you changing the definition of the view in that way. For some views it wouldn't make any impact, but you could imagine a view which attempts to redact sensitive information which does something like "If username == secret then hide or obscure another secret thing" and we wouldn't want a user with permission on the view to affect the collation of that comparison. For that problem, something like SERVER-25954 which would allow mixing two different collations is probably the best solution. That may seem tangential to this use case, but it is related in two ways: 1. If we went the route of SERVER-25954 to allow querying views with a different collation, then for louis.williams's example index, it would still mean that the prefix of the pipeline is using the collection's default collation and thus the index with a different collation wouldn't be applicable. 2. The collection-default collation probably should have some impact on the meaning of the time-series view definition. I'm thinking of the control.min field where we might do string comparisons. Are those currently respecting the collection-default collation? For the specific time-series use case it is more tempting to allow overriding the default collation when querying it like a view because the user didn't create a view here. I can sympathize with an argument that a query on a time-series collection with a different collation should use that given collation everywhere. But I would still be worried about problem #2 above - some of our query rewrites wouldn't work because the collation's are different and I'm not sure how good our analysis is for that case. hana.pearlman or katherine.wu could you comment on that aspect? louis.williams commented on Wed, 31 Mar 2021 21:04:34 +0000: Remove the TODO added by SERVER-55591