BugZero | MongoDB BugID 1906008 - CPS Restores failed with a snapshot with documents...

MongoDB - Defect ID: 1906008

CPS Restores failed with a snapshot with documents in reshardingOperation

MongoDB - Defect ID: 1906008

CPS Restores failed with a snapshot with documents in reshardingOperation

Last updated on 10/29/2023

Overall: 6.36.3

Severity: 6.46.4

Community: 6.86.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.36.3

Severity: 6.46.4

Community: 6.86.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

Manual tested below: Created a 5.0 sharded cluster Create an index (_id: "hashed") Run sh.reshardCollection("data.random", {"_id": "hashed"} ) with the above index, and take a snapshot after 1 min. Restore the above snapshot to a target 5.0 cluster @ 10/22/21 - 12:24 AM Restore failed reshardCollection Enterprise [mongos] data> sh.reshardCollection("data.random", {"_id": "hashed"} ) { ok: 1, '$clusterTime': { clusterTime: Timestamp({ t: 1634858341, i: 56 }), signature: { hash: Binary(Buffer.from("dab217339e8ecc2cd931ac36643be7522b9aa563", "hex"), 0), keyId: Long("7021608605052829705") } }, operationTime: Timestamp({ t: 1634858341, i: 52 }) } Enterprise [mongos] data> db.random.getShardDistribution() Shard atlas-b7bb9t-shard-1 at atlas-b7bb9t-shard-1/atlas-b7bb9t-shard-01-00.qzt8l.mmscloudteam.com:27017,atlas-b7bb9t-shard-01-01.qzt8l.mmscloudteam.com:27017,atlas-b7bb9t-shard-01-02.qzt8l.mmscloudteam.com:27017 { data: '76.04MiB', docs: 1216, chunks: 3, 'estimated data per chunk': '25.34MiB', 'estimated docs per chunk': 405 } --- Shard atlas-b7bb9t-shard-0 at atlas-b7bb9t-shard-0/atlas-b7bb9t-shard-00-00.qzt8l.mmscloudteam.com:27017,atlas-b7bb9t-shard-00-01.qzt8l.mmscloudteam.com:27017,atlas-b7bb9t-shard-00-02.qzt8l.mmscloudteam.com:27017 { data: '83.73MiB', docs: 1339, chunks: 3, 'estimated data per chunk': '27.91MiB', 'estimated docs per chunk': 446 } --- Totals { data: '159.77MiB', docs: 2555, chunks: 6, 'Shard atlas-b7bb9t-shard-1': [ '47.59 % data', '47.59 % docs in cluster', '64KiB avg obj size on shard' ], 'Shard atlas-b7bb9t-shard-0': [ '52.4 % data', '52.4 % docs in cluster', '64KiB avg obj size on shard' ] } Mongo config server fassert error {"t":{"$date":"2021-10-22T00:27:19.458+00:00"},"s":"I", "c":"REPL", "id":21364, "ctx":"ReplCoord-8","msg":"Caught up to the latest optime known via heartbeats after becoming primary","attr":{"targetOpTime":{"ts":{"$timestamp":{"t":1634862438,"i":1}},"t":3},"myLastApplied":{"ts":{"$timestamp":{"t":1634862438,"i":1}},"t":3}}} {"t":{"$date":"2021-10-22T00:27:19.458+00:00"},"s":"I", "c":"REPL", "id":21363, "ctx":"ReplCoord-8","msg":"Exited primary catch-up mode"} {"t":{"$date":"2021-10-22T00:27:19.458+00:00"},"s":"I", "c":"REPL", "id":21107, "ctx":"ReplCoord-8","msg":"Stopping replication producer"} {"t":{"$date":"2021-10-22T00:27:19.458+00:00"},"s":"I", "c":"REPL", "id":21239, "ctx":"ReplBatcher","msg":"Oplog buffer has been drained","attr":{"term":4}} {"t":{"$date":"2021-10-22T00:27:19.458+00:00"},"s":"I", "c":"REPL", "id":21239, "ctx":"ReplBatcher","msg":"Oplog buffer has been drained","attr":{"term":4}} {"t":{"$date":"2021-10-22T00:27:19.458+00:00"},"s":"I", "c":"REPL", "id":21275, "ctx":"ReplCoordExtern-0","msg":"Recreating cursor for oplog fetcher due to error","attr":{"lastOpTimeFetched":{"ts":{"$timestamp":{"t":1634862438,"i":1}},"t":3},"attemptsRemaining":1,"error":"SocketException: Error while getting the next batch in the oplog fetcher :: caused by :: recv failed while exhausting cursor :: caused by :: short read"}} {"t":{"$date":"2021-10-22T00:27:19.458+00:00"},"s":"I", "c":"REPL", "id":21094, "ctx":"BackgroundSync","msg":"Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source"} {"t":{"$date":"2021-10-22T00:33:39.876+00:00"},"s":"I", "c":"CONTROL", "id":20698, "ctx":"-","msg":"***** SERVER RESTARTED *****"} {"t":{"$date":"2021-10-22T00:33:39.889+00:00"},"s":"I", "c":"NETWORK", "id":4913010, "ctx":"-","msg":"Certificate information","attr":{"subject":"CN=*.qzt8l.mmscloudteam.com","issuer":"CN=R3,O=Let's Encrypt,C=US","thumbprint":"854A2FC27CFD4C5098A4E6DA13264DCD62F38C83","notValidBefore":{"$date":"2021-10-21T19:07:17.000Z"},"notValidAfter":{"$date":"2022-01-19T19:07:16.000Z"},"keyFile":"/etc/pki/tls/private/mongod.pem","type":"Server"}} {"t":{"$date":"2021-10-22T00:33:39.890+00:00"},"s":"I", "c":"NETWORK", "id":4915701, "ctx":"-","msg":"Initialized wire specification","attr":{"spec":{"incomingExternalClient":{"minWireVersion":0,"maxWireVersion":13},"incomingInternalClient":{"minWireVersion":0,"maxWireVersion":13},"outgoing":{"minWireVersion":0,"maxWireVersion":13},"isInternalClient":true}}} {"t":{"$date":"2021-10-22T00:33:39.891+00:00"},"s":"W", "c":"ASIO", "id":22601, "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"} {"t":{"$date":"2021-10-22T00:33:39.891+00:00"},"s":"I", "c":"NETWORK", "id":4648601, "ctx":"main","msg":"Implicit TCP FastOpen unavailable. If TCP FastOpen is required, set tcpFastOpenServer, tcpFastOpenClient, and tcpFastOpenQueueSize."} {"t":{"$date":"2021-10-22T00:33:39.892+00:00"},"s":"W", "c":"ASIO", "id":22601, "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"} {"t":{"$date":"2021-10-22T00:33:39.895+00:00"},"s":"W", "c":"ASIO", "id":22601, "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"} {"t":{"$date":"2021-10-22T00:33:39.896+00:00"},"s":"I", "c":"REPL", "id":5123008, "ctx":"main","msg":"Successfully registered PrimaryOnlyService","attr":{"service":"TenantMigrationDonorService","ns":"config.tenantMigrationDonors"}} {"t":{"$date":"2021-10-22T00:33:39.896+00:00"},"s":"I", "c":"REPL", "id":5123008, "ctx":"main","msg":"Successfully registered PrimaryOnlyService","attr":{"service":"TenantMigrationRecipientService","ns":"config.tenantMigrationRecipients"}} {"t":{"$date":"2021-10-22T00:33:39.897+00:00"},"s":"I", "c":"CONTROL", "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":29201,"port":33751,"dbPath":"/srv/mongodb/atlas-4wnh1p-config-0-node-2","architecture":"64-bit","host":"atlas-4wnh1p-config-00-02.qzt8l.mmscloudteam.com"}} {"t":{"$date":"2021-10-22T00:33:39.897+00:00"},"s":"I", "c":"CONTROL", "id":23403, "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"5.0.3","gitVersion":"657fea5a61a74d7a79df7aff8e4bcf0bc742b748","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":["enterprise"],"allocator":"tcmalloc","environment":{"distmod":"rhel70","distarch":"x86_64","target_arch":"x86_64"}}}} {"t":{"$date":"2021-10-22T00:33:39.897+00:00"},"s":"I", "c":"CONTROL", "id":51765, "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"CentOS Linux release 7.9.2009 (Core)","version":"Kernel 3.10.0-1160.41.1.el7.x86_64"}}} {"t":{"$date":"2021-10-22T00:33:39.897+00:00"},"s":"I", "c":"CONTROL", "id":21951, "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"config":"/srv/mongodb/atlas-4wnh1p-config-0-node-2/automation-mongod.conf","net":{"bindIp":"0.0.0.0","compression":{"compressors":"snappy,zstd,zlib"},"maxIncomingConnections":1500,"maxIncomingConnectionsOverride":["192.168.248.0/21"],"port":33751,"tls":{"CAFile":"/etc/pki/tls/certs/atlas-bundle.crt","allowConnectionsWithoutCertificates":true,"certificateKeyFile":"/etc/pki/tls/private/mongod.pem","clusterCAFile":"/var/lib/mongodb-mms-automation/atlas-cluster-managed.crt","disabledProtocols":"TLS1_0,TLS1_1","mode":"requireTLS"}},"processManagement":{"fork":true},"security":{"javascriptEnabled":true},"setParameter":{"allowRolesFromX509Certificates":"true","authenticationMechanisms":"SCRAM-SHA-1,MONGODB-AWS,MONGODB-X509","awsSTSUrl":"https://sts.us-west-1.amazonaws.com","disableLogicalSessionCacheRefresh":"true","honorSystemUmask":"false","mongotHost":"localhost:28000","recoverFromOplogAsStandalone":"true","reportOpWriteConcernCountersInServerStatus":"true","startupRecoveryForRestore":"true","suppressNoTLSPeerCertificateWarning":"true","takeUnstableCheckpointOnShutdown":"true","tlsCATrusts":"[{sha256: \"76D7D7A042F4AAC3BFCF548DD0BD9912B1370854AEEFC4C69D88532654F11633\", roles: [{ role: \"\", db: \"\"}]}]","tlsWithholdClientCertificate":"true","ttlMonitorEnabled":"false","watchdogPeriodSeconds":"60"},"storage":{"dbPath":"/srv/mongodb/atlas-4wnh1p-config-0-node-2","engine":"wiredTiger","wiredTiger":{"engineConfig":{"configString":"cache_size=512MB"}}},"systemLog":{"destination":"file","logAppend":true,"path":"/srv/mongodb/atlas-4wnh1p-config-0-node-2/mongodb.log"}}}} {"t":{"$date":"2021-10-22T00:33:39.899+00:00"},"s":"I", "c":"NETWORK", "id":577164, "ctx":"initandlisten","msg":"OCSP fetch/staple started"} {"t":{"$date":"2021-10-22T00:33:39.901+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"CurlConnPool-0","msg":"Connecting","attr":{"hostAndPort":"r3.o.lencr.org:80"}} {"t":{"$date":"2021-10-22T00:33:39.903+00:00"},"s":"I", "c":"STORAGE", "id":22315, "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=387M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),builtin_extension_config=(zstd=(compression_level=6)),file_manager=(close_idle_time=600,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],cache_size=512MB"}} {"t":{"$date":"2021-10-22T00:33:39.938+00:00"},"s":"I", "c":"NETWORK", "id":577165, "ctx":"OCSPManagerHTTP-0","msg":"OCSP fetch/staple completion"} {"t":{"$date":"2021-10-22T00:33:39.938+00:00"},"s":"I", "c":"NETWORK", "id":577163, "ctx":"OCSPManagerHTTP-0","msg":"OCSP response","attr":{"status":{"code":0,"codeName":"OK"}}} {"t":{"$date":"2021-10-22T00:33:39.941+00:00"},"s":"I", "c":"NETWORK", "id":577164, "ctx":"OCSP Fetch and Staple","msg":"OCSP fetch/staple started"} {"t":{"$date":"2021-10-22T00:33:39.995+00:00"},"s":"I", "c":"NETWORK", "id":577165, "ctx":"OCSPManagerHTTP-0","msg":"OCSP fetch/staple completion"} {"t":{"$date":"2021-10-22T00:33:39.995+00:00"},"s":"I", "c":"NETWORK", "id":577163, "ctx":"OCSPManagerHTTP-0","msg":"OCSP response","attr":{"status":{"code":0,"codeName":"OK"}}} {"t":{"$date":"2021-10-22T00:33:40.565+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862820:565489][29201:0x7fa2d269ebc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 1 through 2"}} {"t":{"$date":"2021-10-22T00:33:40.697+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862820:697230][29201:0x7fa2d269ebc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 2 through 2"}} {"t":{"$date":"2021-10-22T00:33:40.821+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862820:821687][29201:0x7fa2d269ebc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Main recovery loop: starting at 1/33920 to 2/256"}} {"t":{"$date":"2021-10-22T00:33:40.961+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862820:961925][29201:0x7fa2d269ebc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 1 through 2"}} {"t":{"$date":"2021-10-22T00:33:41.045+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862821:45955][29201:0x7fa2d269ebc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 2 through 2"}} {"t":{"$date":"2021-10-22T00:33:41.129+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862821:129209][29201:0x7fa2d269ebc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (0, 0)"}} {"t":{"$date":"2021-10-22T00:33:41.129+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862821:129297][29201:0x7fa2d269ebc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (0, 0)"}} {"t":{"$date":"2021-10-22T00:33:41.131+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862821:131563][29201:0x7fa2d269ebc0], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 1, snapshot max: 1 snapshot count: 0, oldest timestamp: (0, 0) , meta checkpoint timestamp: (0, 0) base write gen: 7"}} {"t":{"$date":"2021-10-22T00:33:41.140+00:00"},"s":"I", "c":"STORAGE", "id":4795906, "ctx":"initandlisten","msg":"WiredTiger opened","attr":{"durationMillis":1237}} {"t":{"$date":"2021-10-22T00:33:41.140+00:00"},"s":"I", "c":"RECOVERY", "id":23987, "ctx":"initandlisten","msg":"WiredTiger recoveryTimestamp","attr":{"recoveryTimestamp":{"$timestamp":{"t":0,"i":0}}}} {"t":{"$date":"2021-10-22T00:33:41.144+00:00"},"s":"I", "c":"STORAGE", "id":4366406, "ctx":"initandlisten","msg":"Modifying the table logging settings for all existing WiredTiger tables","attr":{"loggingEnabled":false}} {"t":{"$date":"2021-10-22T00:33:41.154+00:00"},"s":"I", "c":"STORAGE", "id":22262, "ctx":"initandlisten","msg":"Timestamp monitor starting"} {"t":{"$date":"2021-10-22T00:33:41.158+00:00"},"s":"W", "c":"CONTROL", "id":22120, "ctx":"initandlisten","msg":"Access control is not enabled for the database. Read and write access to data and configuration is unrestricted","tags":["startupWarnings"]} {"t":{"$date":"2021-10-22T00:33:41.158+00:00"},"s":"I", "c":"CONTROL", "id":23408, "ctx":"initandlisten","msg":"Starting Watchdog Monitor"} {"t":{"$date":"2021-10-22T00:33:41.166+00:00"},"s":"I", "c":"NETWORK", "id":4915702, "ctx":"initandlisten","msg":"Updated wire specification","attr":{"oldSpec":{"incomingExternalClient":{"minWireVersion":0,"maxWireVersion":13},"incomingInternalClient":{"minWireVersion":0,"maxWireVersion":13},"outgoing":{"minWireVersion":0,"maxWireVersion":13},"isInternalClient":true},"newSpec":{"incomingExternalClient":{"minWireVersion":0,"maxWireVersion":13},"incomingInternalClient":{"minWireVersion":13,"maxWireVersion":13},"outgoing":{"minWireVersion":13,"maxWireVersion":13},"isInternalClient":true}}} {"t":{"$date":"2021-10-22T00:33:41.166+00:00"},"s":"I", "c":"STORAGE", "id":5071100, "ctx":"initandlisten","msg":"Clearing temp directory"} {"t":{"$date":"2021-10-22T00:33:41.166+00:00"},"s":"I", "c":"STORAGE", "id":21005, "ctx":"initandlisten","msg":"Not restarting unfinished index builds because we are in standalone mode"} {"t":{"$date":"2021-10-22T00:33:41.170+00:00"},"s":"I", "c":"CONTROL", "id":20536, "ctx":"initandlisten","msg":"Flow Control is enabled on this deployment"} {"t":{"$date":"2021-10-22T00:33:41.172+00:00"},"s":"I", "c":"FTDC", "id":20625, "ctx":"initandlisten","msg":"Initializing full-time diagnostic data capture","attr":{"dataDirectory":"/srv/mongodb/atlas-4wnh1p-config-0-node-2/diagnostic.data"}} {"t":{"$date":"2021-10-22T00:33:41.175+00:00"},"s":"W", "c":"REPL", "id":5576601, "ctx":"initandlisten","msg":"Replication startup parameter 'startupRecoveryForRestore' is set and recovering from an unstable checkpoint. Assuming this is a resume of an earlier attempt to recover for restore."} {"t":{"$date":"2021-10-22T00:33:41.178+00:00"},"s":"I", "c":"REPL", "id":21557, "ctx":"initandlisten","msg":"Removing unapplied oplog entries after oplogTruncateAfterPoint","attr":{"oplogTruncateAfterPoint":{"":{"$timestamp":{"t":1634858228,"i":1}}}}} {"t":{"$date":"2021-10-22T00:33:41.178+00:00"},"s":"F", "c":"-", "id":23095, "ctx":"initandlisten","msg":"Fatal assertion","attr":{"msgid":34418,"error":"NamespaceNotFound: Can't find local.oplog.rs","file":"src/mongo/db/repl/replication_recovery.cpp","line":764}} {"t":{"$date":"2021-10-22T00:33:41.178+00:00"},"s":"F", "c":"-", "id":23096, "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"

Top User Comments

max.hirschhorn@10gen.com commented on Mon, 25 Oct 2021 14:51:29 +0000: Assigned this to Sharding EMEA to reconcile the PrimaryOnlyService and ShardRegistry initialization order. Feel free to pass back to Sharding NYC if you don't feel like there's a general solution to take here. chuck.zhang commented on Fri, 22 Oct 2021 16:42:54 +0000: max.hirschhorn I see the cause of the original fassert in the new logs that were uploaded, thanks. The resharding coordinator on the config server primary is unexpectedly getting ShardNotFound when attempting to contact the atlas-4wnh1p-shard-0 shard. Chuck Zhang, could you please confirm/answer the following for me: 1. Is atlas-4wnh1p-shard-0 the destShardName of the restored cluster? 2. What are the contents of the config.shards collection on the config server? Is there an entry for the atlas-4wnh1p-shard-0 shard? 1. yes. atlas-4wnh1p-shard-0 is one of the destShardName of the restored cluster 2. I can't connect to the config server after the restore since it's not up. Connecting to: mongodb+srv://cluster0.qzt8l.mmscloudteam.com/config MongoNetworkError: connect ECONNREFUSED 13.56.37.245:27016 max.hirschhorn@10gen.com commented on Fri, 22 Oct 2021 16:29:56 +0000: Is it by any chance a problem of replication's step-up actions executing before that method has completed? Yes, I think the issue could be framed that way. Would we also face a similar issue in the Sharding DDL Coordinators? Or how do those primary-only service Instances handle ShardNotFound? kaloian.manassiev commented on Fri, 22 Oct 2021 16:18:01 +0000: It looks like we have the waitForShardRegistryReload() with which we block opening the connections before the SahrdRegistry has performed a refresh. Is it by any chance a problem of replication's step-up actions executing before that method has completed? max.hirschhorn@10gen.com commented on Fri, 22 Oct 2021 16:05:10 +0000: It looks like the ShardRegistry may be in an unqueryable state because using read concern level "majority" internally. Assuming that is expected, kaloian.manassiev, tommaso.tocci, how should other sharding components be waiting for the first reload of the ShardRegistry to have succeeded? atlas-4wnh1p-config-00-02.qzt8l.mmscloudteam.com_2021-10-22T00_00_00_2021-10-22T03_30_00_mongodb.log Unable to find source-code formatter for language: text. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml {"t":{"$date":"2021-10-22T00:32:44.246+00:00"},"s":"I", "c":"SHARDING", "id":22727, "ctx":"shard-registry-reload","msg":"Error running periodic reload of shard registry","attr":{"error":"ReadConcernMajorityNotAvailableYet: could not get updated shard list from config server :: caused by :: Read concern majority reads are currently not possible.","shardRegistryReloadIntervalSeconds":1}} max.hirschhorn@10gen.com commented on Fri, 22 Oct 2021 15:53:46 +0000: I see the cause of the original fassert in the new logs that were uploaded, thanks. The resharding coordinator on the config server primary is unexpectedly getting ShardNotFound when attempting to contact the atlas-4wnh1p-shard-0 shard. chuck.zhang, could you please confirm/answer the following for me: Is atlas-4wnh1p-shard-0 the destShardName of the restored cluster? What are the contents of the config.shards collection on the config server? Is there an entry for the atlas-4wnh1p-shard-0 shard? atlas-4wnh1p-config-00-02.qzt8l.mmscloudteam.com_2021-10-22T00_00_00_2021-10-22T03_30_00_mongodb.log Unable to find source-code formatter for language: text. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml {"t":{"$date":"2021-10-22T00:32:44.683+00:00"},"s":"F", "c":"RESHARD", "id":5277000, "ctx":"ReshardingCoordinatorService-2","msg":"Unrecoverable error past the point resharding was guaranteed to succeed","attr":{"error":"ShardNotFound: Failed command { _shardsvrCommitReshardCollection: \"data.random\", reshardingUUID: UUID(\"04044ed9-81d2-4ea9-83cd-66a746fa780e\"), writeConcern: { w: \"majority\" } } for database 'admin' on shard 'atlas-4wnh1p-shard-0' :: caused by :: Could not find shard atlas-4wnh1p-shard-0"}} {"t":{"$date":"2021-10-22T00:32:44.683+00:00"},"s":"F", "c":"-", "id":23089, "ctx":"ReshardingCoordinatorService-2","msg":"Fatal assertion","attr":{"msgid":5277000,"file":"src/mongo/db/s/resharding/resharding_coordinator_service.cpp","line":1286}} {"t":{"$date":"2021-10-22T00:32:44.683+00:00"},"s":"F", "c":"-", "id":23090, "ctx":"ReshardingCoordinatorService-2","msg":"\n\n***aborting after fassert() failure\n\n"} {"t":{"$date":"2021-10-22T00:32:44.683+00:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"ReshardingCoordinatorService-2","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}} chuck.zhang commented on Fri, 22 Oct 2021 15:10:14 +0000: max.hirschhorn, Sorry for being not clear on the ask. I wasn't aware the mongod logs were missing. I've reattached the mongod logs which covers the time range of the restore. File atlas-4wnh1p-config-00-02.qzt8l.mmscloudteam.com_2021-10-22T00_00_00_2021-10-22T03_30_00_mongodb seems quite related to the root cause. max.hirschhorn@10gen.com commented on Fri, 22 Oct 2021 14:44:43 +0000: chuck.zhang, I didn't realize you were asking specifically about the restart attempt at 2021-10-22T00:33:24.672. Is there a reason mongodb.log doesn't cover that time range when logAppend: true is set in the automation configuration? It looks like there was an fassert() at that time too. Unfortunately the logs from mongod in the automation log are truncated so I couldn't say whether the original fassert was location code 34418 or not. I suspect it probably was though. Would it be possible to rerun the test/procedure with the maxLogSizeKB server parameter set to a larger value to avoid this log truncation? --setParameter maxLogSizeKB=20 automation-agent-verbose.log Unable to find source-code formatter for language: text. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml {"t":{"$date":"2021-10-22T00:33:24.614+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862804:614497][30560:0x7f06b853abc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (0, 0)"}} {"t":{"$date":"2021-10-22T00:33:24.614+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1634862804:614561][30560:0x7f06b853abc0], txn-recover: [WT_VERB_RECOVERY | WT_V...6, "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"} : exit status 14 (warning: log line attempted (13.9k) over max size (10.0k), printing beginning and end) chuck.zhang commented on Fri, 22 Oct 2021 14:28:40 +0000: max.hirschhorn Thanks for looking. Pulling out the error from the attached log file for ease of viewing and searchability. Is there some part of the restore procedure which would explain why the oplog collection doesn't exist on the config server node? At first glance I'm not sure how this is connected to taking a backup while there's a resharding operation. mongodb.log Unknown macro: {"t"} ,"s":"W", "c":"REPL", "id":5576601, "ctx":"initandlisten","msg":"Replication startup parameter 'startupRecoveryForRestore' is set and recovering from an unstable checkpoint. Assuming this is a resume of an earlier attempt to recover for restore."} Unknown macro: {"t"} ,"s":"I", "c":"REPL", "id":21557, "ctx":"initandlisten","msg":"Removing unapplied oplog entries after oplogTruncateAfterPoint","attr":{"oplogTruncateAfterPoint":{"": Unknown macro: {"$timestamp"} }}} Unknown macro: {"t"} ,"s":"F", "c":"-", "id":23095, "ctx":"initandlisten","msg":"Fatal assertion","attr":{"msgid":34418,"error":"NamespaceNotFound: Can't find local.oplog.rs","file":"src/mongo/db/repl/replication_recovery.cpp","line":764}} Unknown macro: {"t"} ,"s":"F", "c":"-", "id":23096, "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"} The logs above is actually 1 and half hours after the backup restore initiated. It seems the mongod took 2 hours to have some initial logs printed and it only lasts for 1 min. Do you see anything suspicious in the below log that causing mongod fail to restart? [2021-10-22T00:33:24.672+0000] [.error] [src/action/start.go:func1:93] [103] [00:33:24.672] Error starting mongod : [00:33:24.672] Error running start command. cmd=[Args=[/var/lib/mongodb-mms-automation/mongodb-linux-x86_64-5.0.3-ent/bin/mongod -f /srv/mongodb/atlas-4wnh1p-config-0-node-1/automation-mongod.conf]], stip=[args={"net":{"bindIp":"0.0.0.0","compression":{"compressors":"snappy,zstd,zlib"},"maxIncomingConnections":1500,"maxIncomingConnectionsOverride":["192.168.248.0/21"],"port":41220,"tls":{"CAFile":"/etc/pki/tls/certs/atlas-bundle.crt","allowConnectionsWithoutCertificates":true,"certificateKeyFile":"/etc/pki/tls/private/mongod.pem","clusterCAFile":"/var/lib/mongodb-mms-automation/atlas-cluster-managed.crt","disabledProtocols":"TLS1_0,TLS1_1","mode":"requireTLS"}},"security":{"javascriptEnabled":true},"setParameter":{"allowRolesFromX509Certificates":"true","authenticationMechanisms":"SCRAM-SHA-1,MONGODB-AWS,MONGODB-X509","awsSTSUrl":"https://sts.us-west-1.amazonaws.com","disableLogicalSessionCacheRefresh":"true","honorSystemUmask":"false","mongotHost":"localhost:28000","recoverFromOplogAsStandalone":"true","reportOpWriteConcernCountersInServerStatus":"true","startupRecoveryForRestore":"true","suppressNoTLSPeerCertificateWarning":"true","takeUnstableCheckpointOnShutdown":"true","tlsCATrusts":"[{sha256: \"76D7D7A042F4AAC3BFCF548DD0BD9912B1370854AEEFC4C69D88532654F11633\", roles: [{ role: \"\", db: \"\"}]}]","tlsWithholdClientCertificate":"true","ttlMonitorEnabled":"false","watchdogPeriodSeconds":"60"},"storage":{"dbPath":"/srv/mongodb/atlas-4wnh1p-config-0-node-1","wiredTiger":{"engineConfig":{"configString":"cache_size=512MB"}}},"systemLog":{"destination":"file","logAppend":true,"path":"/srv/mongodb/atlas-4wnh1p-config-0-node-1/mongodb.log"}}[],confPath=/srv/mongodb/atlas-4wnh1p-config-0-node-1/automation-mongod.conf,version=5.0.3-ent-657fea5a61a74d7a79df7aff8e4bcf0bc742b748(enterprise),isKmipRotateMasterKey=false,useOldConfFile=false][2021-10-22T00:33:24.672+0000] [.error] [src/action/start.go:func1:93] [103] [00:33:24.672] Error starting mongod : [00:33:24.672] Error running start command. cmd=[Args=[/var/lib/mongodb-mms-automation/mongodb-linux-x86_64-5.0.3-ent/bin/mongod -f /srv/mongodb/atlas-4wnh1p-config-0-node-1/automation-mongod.conf]], stip=[args={"net":{"bindIp":"0.0.0.0","compression":{"compressors":"snappy,zstd,zlib"},"maxIncomingConnections":1500,"maxIncomingConnectionsOverride":["192.168.248.0/21"],"port":41220,"tls":{"CAFile":"/etc/pki/tls/certs/atlas-bundle.crt","allowConnectionsWithoutCertificates":true,"certificateKeyFile":"/etc/pki/tls/private/mongod.pem","clusterCAFile":"/var/lib/mongodb-mms-automation/atlas-cluster-managed.crt","disabledProtocols":"TLS1_0,TLS1_1","mode":"requireTLS"}},"security":{"javascriptEnabled":true},"setParameter":{"allowRolesFromX509Certificates":"true","authenticationMechanisms":"SCRAM-SHA-1,MONGODB-AWS,MONGODB-X509","awsSTSUrl":"https://sts.us-west-1.amazonaws.com","disableLogicalSessionCacheRefresh":"true","honorSystemUmask":"false","mongotHost":"localhost:28000","recoverFromOplogAsStandalone":"true","reportOpWriteConcernCountersInServerStatus":"true","startupRecoveryForRestore":"true","suppressNoTLSPeerCertificateWarning":"true","takeUnstableCheckpointOnShutdown":"true","tlsCATrusts":"[{sha256: \"76D7D7A042F4AAC3BFCF548DD0BD9912B1370854AEEFC4C69D88532654F11633\", roles: [{ role: \"\", db: \"\"}]}]","tlsWithholdClientCertificate":"true","ttlMonitorEnabled":"false","watchdogPeriodSeconds":"60"},"storage":{"dbPath":"/srv/mongodb/atlas-4wnh1p-config-0-node-1","wiredTiger":{"engineConfig":{"configString":"cache_size=512MB"}}},"systemLog":{"destination":"file","logAppend":true,"path":"/srv/mongodb/atlas-4wnh1p-config-0-node-1/mongodb.log"}}[],confPath=/srv/mongodb/atlas-4wnh1p-config-0-node-1/automation-mongod.conf,version=5.0.3-ent-657fea5a61a74d7a79df7aff8e4bcf0bc742b748(enterprise),isKmipRotateMasterKey=false,useOldConfFile=false] - Mongo Logs: .qzt8l.mmscloudteam.com"}}{"t":{"$date":"2021-10-22T00:33:23.470+00:00"},"s":"I", "c":"CONTROL", "id":23403, "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"5.0.3","gitVersion":"657fea5a61a74d7a79df7aff8e4bcf0bc742b748","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":["enterprise"],"allocator":"tcmalloc","environment":{"distmod":"rhel70","distarch":"x86_64","target_arch":"x86_64"}}}}{"t":{"$date":"2021-10-22T00:33:23.470+00:00"},"s":"I", "c":"CONTROL", "id":51765, "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"CentOS Linux release 7.9.2009 (Core)","version":"Kernel 3.10.0-1160.41.1.el7.x86_64"}}}{"t":{"$date":"2021-10-22T00:33:23.470+00:00"},"s":"I", "c":"CONTROL", "id":21951, "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"config":"/srv/mongodb/atlas-4wnh1p-config-0-node-1/automation-mongod.conf","net":{"bindIp":"0.0.0.0","compression":{"compressors":"snappy,zstd,zlib"},"maxIncomingConnections":1500,"maxIncomingConnectionsOverride":["192.168.248.0/21"],"port":41220,"tls":{"CAFile":"/etc/pki/tls/certs/atlas-bundle.crt","allowConnectionsWithoutCertificates":true,"certificateKeyFile":"/etc/pki/tls/private/mongod.pem","clusterCAFile":"/var/lib/mongodb-mms-automation/atlas-cluster-managed.crt","disabledProtocols":"TLS1_0,TLS1_1","mode":"requireTLS"}},"processManagement":{"fork":true},"security":{"javascriptEnabled":true},"setParameter":{"allowRolesFromX509Certificates":"true","authenticationMechanisms":"SCRAM-SHA-1,MONGODB-AWS,MONGODB-X509","awsSTSUrl":"https://sts.us-west-1.amazonaws.com","disableLogicalSessionCacheRefresh":"true","honorSystemUmask":"false","mongotHost":"localhost:28000","recoverFromOplogAsStandalone":"true","reportOpWriteConcernCountersInServerStatus":"true","startupRecoveryForRestore":"true","suppressNoTLSPeerCertificateWarning":"true","takeUnstableCheckpointOnShutdown":"true","tlsCATrusts":"[{sha256: \"76D7D7A042F4AAC3BFCF548DD0BD9912B1370854AEEFC4C69D88532654F11633\", roles: [{ role: \"\", db: \"\"}]}]","tlsWithholdClientCertificate":"true","ttlMonitorEnabled":"false","watchdogPeriodSeconds":"60"},"storage":{"dbPath":"/srv/mongodb/atlas-4wnh1p-config-0-node-1","engine":"wiredTiger","wiredTiger":{"engineConfig":{"configString":"cache_size=512MB"}}},"systemLog":{"destination":"file","logAppend":true,"path":"/srv/mongodb/atlas-4wnh1p-config-0-node-1/mongodb.log"}}}} It looks like we've seen this issue on non-5.0 MongoDB versions as well. Notes from server team: "Seems like it ignored the catalog file you restored and made a new one. Are you sure all the files were restored properly before starting mongod?" https://jira.mongodb.org/browse/CLOUDP-78834?focusedCommentId=3909062&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3909062 As commented in the above link, we haven't seen any issues causing by this so far since it's an existing behavior before reshard collection was introduced. max.hirschhorn@10gen.com commented on Fri, 22 Oct 2021 11:59:41 +0000: Pulling out the error from the attached log file for ease of viewing and searchability. Is there some part of the restore procedure which would explain why the oplog collection doesn't exist on the config server node? At first glance I'm not sure how this is connected to taking a backup while there's a resharding operation. mongodb.log Unable to find source-code formatter for language: text. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml {"t":{"$date":"2021-10-22T02:47:12.044+00:00"},"s":"W", "c":"REPL", "id":5576601, "ctx":"initandlisten","msg":"Replication startup parameter 'startupRecoveryForRestore' is set and recovering from an unstable checkpoint. Assuming this is a resume of an earlier attempt to recover for restore."} {"t":{"$date":"2021-10-22T02:47:12.047+00:00"},"s":"I", "c":"REPL", "id":21557, "ctx":"initandlisten","msg":"Removing unapplied oplog entries after oplogTruncateAfterPoint","attr":{"oplogTruncateAfterPoint":{"":{"$timestamp":{"t":1634858228,"i":1}}}}} {"t":{"$date":"2021-10-22T02:47:12.047+00:00"},"s":"F", "c":"-", "id":23095, "ctx":"initandlisten","msg":"Fatal assertion","attr":{"msgid":34418,"error":"NamespaceNotFound: Can't find local.oplog.rs","file":"src/mongo/db/repl/replication_recovery.cpp","line":764}} {"t":{"$date":"2021-10-22T02:47:12.047+00:00"},"s":"F", "c":"-", "id":23096, "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"}

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 1906008

CPS Restores failed with a snapshot with documents in reshardingOperation

MongoDB - Defect ID: 1906008

CPS Restores failed with a snapshot with documents in reshardingOperation

Last updated on 10/29/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?