QKC Cluster errors stops cluster instance on docker

How much RAM you have on your cluster server?

8 GB Available RAM of 12 GB, only about 4 GB used in total
4 core CPU load about 2.5 - 3.5

Since the last recovery by downloading the last data and restarting the docker image with this data, the cluster is running without errors now for almost 36 hours

Had to stop the cluster again as rewards weren’t coming in. Restarting produced this error lines

SLAVE_S2: E0620 12:49:51.347818 protocol.py:168] Traceback (most recent call last):

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S2: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 152, in handle_metadata_and_raw_data

SLAVE_S2: await self.__handle_rpc_request(op, cmd, rpc_id, metadata)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 127, in __handle_rpc_request

SLAVE_S2: resp = await handler(self, request)

SLAVE_S2: File “slave.py”, line 749, in handle_add_xshard_tx_list_request

SLAVE_S2: req.minor_block_hash, req.tx_list

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_state.py”, line 1325, in add_cross_shard_tx_list_by_minor_block_hash

SLAVE_S2: self.db.put_minor_block_xshard_tx_list(h, tx_list)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_db_operator.py”, line 478, in put_minor_block_xshard_tx_list

SLAVE_S2: self.db.put(b"xShard_" + h, tx_list.serialize())

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 1300880301, got 2310159692 in ./qkc-data/mainnet/shard-393217.db/006376.sst offset 31076911 size 2349’

SLAVE_S2:

SLAVE_S2: I0620 12:49:51.348179 slave.py:710] Closing connection with slave b’S0’

SLAVE_S0: E0620 12:49:51.352953 miner.py:339] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/miner.py”, line 333, in submit_work

SLAVE_S0: await self.add_block_async_func(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 544, in __add_block

SLAVE_S0: await self.handle_new_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 708, in handle_new_block

SLAVE_S0: await self.add_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 770, in add_block

SLAVE_S0: await self.slave.broadcast_xshard_tx_list(block, xshard_list, prev_root_height)

SLAVE_S0: File “slave.py”, line 1105, in broadcast_xshard_tx_list

SLAVE_S0: responses = await asyncio.gather(*rpc_futures)

SLAVE_S0: RuntimeError: S0<->S2: connection abort

and further on

MASTER: E0620 12:50:11.319340 master.py:120] Traceback (most recent call last):

MASTER: File “master.py”, line 118, in sync

MASTER: await self.__run_sync()

MASTER: File “master.py”, line 262, in __run_sync

MASTER: await self.__add_block(block)

MASTER: File “master.py”, line 286, in __add_block

MASTER: await self.master_server.add_root_block(root_block)

MASTER: File “master.py”, line 1280, in add_root_block

MASTER: result_list = await asyncio.gather(*future_list)

MASTER: RuntimeError: master_slave_b’S2’: connection abort

MASTER:

MASTER: I0620 12:50:11.320236 simple_network.py:177] Closing peer 5ec483ebffad228d24f62423c984153d801a7ce8bfdba5019ea721d054ead893 with the following reason: master_slave_b’S2’: connection abort

MASTER: I0620 12:50:11.321521 master.py:423] [R] done sync task 628800 1015aeb71bf08dae59fc2412b398f7b3315a1de296fc01bb74eadcf85322b9fc

MASTER: I0620 12:50:11.322227 master.py:428] [R] synchronizer finished!

MASTER: E0620 12:50:11.326245 p2p_manager.py:146] Unknown exception from <Node(0x5ec4@116.203.214.61)>, message: OperationCancelled(‘Cancellation requested by QuarkServer:p2pserver token’)

MASTER: E0620 12:50:11.333310 p2p_manager.py:147] Traceback (most recent call last):

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 131, in _run

MASTER: metadata, raw_data = await self.secure_peer.read_metadata_and_raw_data()

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 292, in read_metadata_and_raw_data

MASTER: data = await self.quark_peer.read_raw_bytes(timeout=None)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 115, in read_raw_bytes

MASTER: frame_data = await self.read(frame_size + MAC_LEN, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/peer.py”, line 344, in read

MASTER: return await self.wait(self.reader.readexactly(n), timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancellable.py”, line 18, in wait

MASTER: return await self.wait_first(awaitable, token=token, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancellable.py”, line 42, in wait_first

MASTER: return await token_chain.cancellable_wait(*awaitables, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancel_token/token.py”, line 144, in cancellable_wait

MASTER: “Cancellation requested by {} token”.format(self.triggered_token)

MASTER: quarkchain.p2p.cancel_token.exceptions.OperationCancelled: Cancellation requested by QuarkServer:p2pserver token

MASTER:

SLAVE_S1: I0620 12:50:11.337485 shard_state.py:1493] [1/0] shard tip reset from 3640728 to 3640746 by root block 627356

MASTER: I0620 12:50:11.337944 peer.py:1125] QuarkPeer <Node(0x5ec4@116.203.214.61)> finished, removing from pool

MASTER: I0620 12:50:11.350669 discovery.py:500] stopping discovery

MASTER: I0620 12:50:11.352723 peer.py:943] Stopping all peers …

MASTER: I0620 12:50:11.452553 p2p_server.py:178] Closing server…

SLAVE_S1: I0620 12:50:11.453733 shard_state.py:1493] [5/0] shard tip reset from 3637361 to 3637379 by root block 627356

MASTER: I0620 12:50:11.455222 master.py:1864] Master server is shutdown

SLAVE_S2 is dead. Shutting down the cluster…

Have you ever received rewards? It’s strange that you can’t receive the rewards unless the node didn’t finish syncing.
Please try stats tools to find out if it finished syncing.

Sure I have received rewards. Used the stats tool when they weren’t coming anymore and that’s why I recognised in the first place that syncing was ongoing but no more blocks were added, so it was stuck… And after restarting this error occurred. I now have solved it again by downloading the data bootstrap from saturday and restart the cluster with those data, the chain is synced and rewards are coming in again. So the only solution when such an error occurs is downloading the the latest data and start from there…

Well, @junjiah may have some ideas about this.

And yesterday evening it happened again:

SLAVE_S0: I0622 20:29:23.712678 shard.py:703] [0/0] got new block with height 3691243

SLAVE_S2: E0622 20:29:23.745532 protocol.py:168] Traceback (most recent call last):

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S2: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 152, in handle_metadata_and_raw_data

SLAVE_S2: await self.__handle_rpc_request(op, cmd, rpc_id, metadata)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 127, in __handle_rpc_request

SLAVE_S2: resp = await handler(self, request)

SLAVE_S2: File “slave.py”, line 749, in handle_add_xshard_tx_list_request

SLAVE_S2: req.minor_block_hash, req.tx_list

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_state.py”, line 1325, in add_cross_shard_tx_list_by_minor_block_hash

SLAVE_S2: self.db.put_minor_block_xshard_tx_list(h, tx_list)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_db_operator.py”, line 478, in put_minor_block_xshard_tx_list

SLAVE_S2: self.db.put(b"xShard_" + h, tx_list.serialize())

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 823391264, got 1138716802 in ./qkc-data/mainnet/shard-131073.db/006482.sst offset 59586622 size 3341’

SLAVE_S2:

SLAVE_S2: I0622 20:29:23.745861 slave.py:710] Closing connection with slave b’S0’

SLAVE_S0: E0622 20:29:23.749307 protocol.py:181] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 175, in loop_once

SLAVE_S0: metadata, raw_data = await self.read_metadata_and_raw_data()

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 269, in read_metadata_and_raw_data

SLAVE_S0: size_bytes = await self.__read_fully(4, allow_eof=True)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 254, in __read_fully

SLAVE_S0: bs = await self.reader.read(n)

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/streams.py”, line 640, in read

SLAVE_S0: await self._wait_for_data(‘read’)

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/streams.py”, line 473, in _wait_for_data

SLAVE_S0: await self._waiter

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/selector_events.py”, line 804, in _read_ready__data_received

SLAVE_S0: data = self._sock.recv(self.max_size)

SLAVE_S0: ConnectionResetError: [Errno 104] Connection reset by peer

SLAVE_S0:

SLAVE_S0: I0622 20:29:23.749708 slave.py:710] Closing connection with slave b’S2’

SLAVE_S0: E0622 20:29:23.750336 protocol.py:168] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S0: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 147, in handle_metadata_and_raw_data

SLAVE_S0: await self.__handle_request(op, cmd)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 123, in __handle_request

SLAVE_S0: await handler(self, op, request, 0)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 197, in handle_new_block_minor_command

SLAVE_S0: await self.shard.handle_new_block(cmd.block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 708, in handle_new_block

SLAVE_S0: await self.add_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 770, in add_block

SLAVE_S0: await self.slave.broadcast_xshard_tx_list(block, xshard_list, prev_root_height)

SLAVE_S0: File “slave.py”, line 1105, in broadcast_xshard_tx_list

SLAVE_S0: responses = await asyncio.gather(*rpc_futures)

SLAVE_S0: RuntimeError: S0<->S2: connection abort

SLAVE_S0:

SLAVE_S0: E0622 20:29:23.750675 shard.py:68] Closing shard connection with error slave_master_vconn_11408676719790611844: error processing request: S0<->S2: connection abort

seems the key reason is about rocksdb

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 1300880301, got 2310159692 in ./qkc-data/mainnet/shard-393217.db/006376.sst offset 31076911 size 2349’

SLAVE_S2:

SLAVE_S2: I0620 12:49:51.348179 slave.py:710] Closing connection with slave b’S0’

Can you give us your command for starting the docker image and the cluster? I wonder if the dependency is broken. you are probably the only one who regularly sees this issue.

Hi,

I first export the config file:

export QKC_CONFIG=pwd/mainnet/singularity/cluster_config_template.json

After that start the cluster

python3 quarkchain/cluster/cluster.py --cluster_config $QKC_CONFIG

you said you are using our docker images, right? can you also provide the command on how you start the container?

docker run -v /QKC/data:/code/pyquarkchain/quarkchain/cluster/qkc-data/mainnet -it --restart=always -p 38291:38291 -p 38391:38391 -p 38491:38491 -p 38291:38291/udp quarkchaindocker/pyquarkchain:mainnet1.4.2

Interesting problem. I have setup several nodes but cannot find the issue (although I don’t use these nodes to mine). From the error, it tells that rockdb’s data files are corrupted, which I also heard that several other projects hit the corruption issue (e.g., geth).

I am going to setup a node with mining to see if it is reproducible. Further, do you start the node with the snapshot from us or you start the node without a snapshot (and thus it downloads all the blocks from peers).

I start with the snapshot from you

OK. Another issue may be from snapshot. I will also have a try.

Running a pyqkc with snapshot for about 12 hours, no issue is found yet.

Also on my side. I am running cluster now for several days without any issues.

Also in the past I was able to run it for several months without crash on Digital Ocean VPS and also on my home computer.

@MenMenM any update?

No update for now as I was waiting if you have found something yet.

Looks like the error can not be reproduced. So have to change the setup, the errors happen too often, good tip to use a VPS server maybe to form a cluster to mine.

the same as here, question did you fix it?
also is there any way to auto restart it?
maybe someone may share best practices to build it for now, thank you in advance