QKC Cluster errors stops cluster instance on docker

Running a cluster I get these errors "SLAVE_S2: W0616 07:20:26.066809 shard.py:827] minor block to sync 47333514e5040b28e6b044172c2592dc11c4fef9aeb1aed31038945cc5323408 is already committed

MASTER: E0616 07:20:26.071119 master.py:120] Traceback (most recent call last):

MASTER: File “master.py”, line 118, in sync

MASTER: await self.__run_sync()

MASTER: File “master.py”, line 262, in __run_sync

MASTER: await self.__add_block(block)

MASTER: File “master.py”, line 285, in __add_block

MASTER: await self.__sync_minor_blocks(root_block.minor_block_header_list)

MASTER: File “master.py”, line 333, in __sync_minor_blocks

MASTER: m_header.get_hash().hex(), m_header.branch.to_str()

MASTER: RuntimeError: minor block 96089d07d4192f921e53d2748e3542a6f0ad4b103dc5aa98d85714ce5f8d1bf2 from 2/0 is still unavailable in master after root block sync

MASTER:

MASTER: I0616 07:20:26.071496 simple_network.py:177] Closing peer 45250555ed31ecac9e3e68fd03bc8b7c540de425c54badca2dd4e771a16e7b0e with the following reason: minor block 96089d07d4192f921e53d2748e3542a6f0ad4b103dc5aa98d85714ce5f8d1bf2 from 2/0 is still unavailable in master after root block sync

MASTER: I0616 07:20:26.071752 p2p_manager.py:301] destroying proxy slave connections for <Node(0x0d62@159.89.142.41)>"

They happen a while and than the cluster shuts down

Still mining is going on “MASTER: I0616 07:22:26.429503 jsonrpc.py:531] {“id”:1,“jsonrpc”:“2.0”,“method”:“getWork”,“params”:[“0x1”]}” from all the clients but no rewards are earned in the end

Any idea how to solve this? I know I can download the latest daily data and than start a new docker instance from that new data, this will solve the problem in the end for a while… until it happens again. Should be an easier and less timeconsuming way. Downloading all the data takes hours

Hi, can you tell us how to reproduce it?

I run a QKC cluster inside a docker container and after running some time without any errors this happens. Here is the wiki how to run a cluster in docker https://github.com/QuarkChain/pyquarkchain/wiki/Start-Clusters-on-the-QuarkChain

So I do nothing at all different. Just clients connecting to the cluster to mine

Hi, which version of image are you using? How often did the error happen? Did you modify the code of pyquarkchain? What miner did you use?

1 Like

quarkchain dev here. have made any modification to the code?

we want to have a reliable way to reproduce this problem but can’t. can you help us on writing about how to reproduce this bug reliably?

I used mainnet 1.4.2 . The only thing changed is the reward address, set to an address of me :wink: and the host set to 0.0.0.0 instead of 127.0.0.1 to make it possible to connect to the docker quark chain instance. The miner I use is ethminer: https://github.com/QuarkChain/ethminer.git compiled from source with CUDA (some miners use cuda 10.2, some cuda 11.0) Everything runs smoothly until the first error after a while (sometimes a couple of hours, some times a day) and than the error keeps occurring, which sometimes crashes the running quarkchain instance, but from the moment of the first error no rewards are received although the miners keep mining and display no errors on the client side and in the quark chain instances the getWorks keep coming

How much RAM you have on your cluster server?

8 GB Available RAM of 12 GB, only about 4 GB used in total
4 core CPU load about 2.5 - 3.5

Since the last recovery by downloading the last data and restarting the docker image with this data, the cluster is running without errors now for almost 36 hours

Had to stop the cluster again as rewards weren’t coming in. Restarting produced this error lines

SLAVE_S2: E0620 12:49:51.347818 protocol.py:168] Traceback (most recent call last):

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S2: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 152, in handle_metadata_and_raw_data

SLAVE_S2: await self.__handle_rpc_request(op, cmd, rpc_id, metadata)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 127, in __handle_rpc_request

SLAVE_S2: resp = await handler(self, request)

SLAVE_S2: File “slave.py”, line 749, in handle_add_xshard_tx_list_request

SLAVE_S2: req.minor_block_hash, req.tx_list

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_state.py”, line 1325, in add_cross_shard_tx_list_by_minor_block_hash

SLAVE_S2: self.db.put_minor_block_xshard_tx_list(h, tx_list)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_db_operator.py”, line 478, in put_minor_block_xshard_tx_list

SLAVE_S2: self.db.put(b"xShard_" + h, tx_list.serialize())

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 1300880301, got 2310159692 in ./qkc-data/mainnet/shard-393217.db/006376.sst offset 31076911 size 2349’

SLAVE_S2:

SLAVE_S2: I0620 12:49:51.348179 slave.py:710] Closing connection with slave b’S0’

SLAVE_S0: E0620 12:49:51.352953 miner.py:339] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/miner.py”, line 333, in submit_work

SLAVE_S0: await self.add_block_async_func(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 544, in __add_block

SLAVE_S0: await self.handle_new_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 708, in handle_new_block

SLAVE_S0: await self.add_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 770, in add_block

SLAVE_S0: await self.slave.broadcast_xshard_tx_list(block, xshard_list, prev_root_height)

SLAVE_S0: File “slave.py”, line 1105, in broadcast_xshard_tx_list

SLAVE_S0: responses = await asyncio.gather(*rpc_futures)

SLAVE_S0: RuntimeError: S0<->S2: connection abort

and further on

MASTER: E0620 12:50:11.319340 master.py:120] Traceback (most recent call last):

MASTER: File “master.py”, line 118, in sync

MASTER: await self.__run_sync()

MASTER: File “master.py”, line 262, in __run_sync

MASTER: await self.__add_block(block)

MASTER: File “master.py”, line 286, in __add_block

MASTER: await self.master_server.add_root_block(root_block)

MASTER: File “master.py”, line 1280, in add_root_block

MASTER: result_list = await asyncio.gather(*future_list)

MASTER: RuntimeError: master_slave_b’S2’: connection abort

MASTER:

MASTER: I0620 12:50:11.320236 simple_network.py:177] Closing peer 5ec483ebffad228d24f62423c984153d801a7ce8bfdba5019ea721d054ead893 with the following reason: master_slave_b’S2’: connection abort

MASTER: I0620 12:50:11.321521 master.py:423] [R] done sync task 628800 1015aeb71bf08dae59fc2412b398f7b3315a1de296fc01bb74eadcf85322b9fc

MASTER: I0620 12:50:11.322227 master.py:428] [R] synchronizer finished!

MASTER: E0620 12:50:11.326245 p2p_manager.py:146] Unknown exception from <Node(0x5ec4@116.203.214.61)>, message: OperationCancelled(‘Cancellation requested by QuarkServer:p2pserver token’)

MASTER: E0620 12:50:11.333310 p2p_manager.py:147] Traceback (most recent call last):

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 131, in _run

MASTER: metadata, raw_data = await self.secure_peer.read_metadata_and_raw_data()

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 292, in read_metadata_and_raw_data

MASTER: data = await self.quark_peer.read_raw_bytes(timeout=None)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 115, in read_raw_bytes

MASTER: frame_data = await self.read(frame_size + MAC_LEN, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/peer.py”, line 344, in read

MASTER: return await self.wait(self.reader.readexactly(n), timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancellable.py”, line 18, in wait

MASTER: return await self.wait_first(awaitable, token=token, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancellable.py”, line 42, in wait_first

MASTER: return await token_chain.cancellable_wait(*awaitables, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancel_token/token.py”, line 144, in cancellable_wait

MASTER: “Cancellation requested by {} token”.format(self.triggered_token)

MASTER: quarkchain.p2p.cancel_token.exceptions.OperationCancelled: Cancellation requested by QuarkServer:p2pserver token

MASTER:

SLAVE_S1: I0620 12:50:11.337485 shard_state.py:1493] [1/0] shard tip reset from 3640728 to 3640746 by root block 627356

MASTER: I0620 12:50:11.337944 peer.py:1125] QuarkPeer <Node(0x5ec4@116.203.214.61)> finished, removing from pool

MASTER: I0620 12:50:11.350669 discovery.py:500] stopping discovery

MASTER: I0620 12:50:11.352723 peer.py:943] Stopping all peers …

MASTER: I0620 12:50:11.452553 p2p_server.py:178] Closing server…

SLAVE_S1: I0620 12:50:11.453733 shard_state.py:1493] [5/0] shard tip reset from 3637361 to 3637379 by root block 627356

MASTER: I0620 12:50:11.455222 master.py:1864] Master server is shutdown

SLAVE_S2 is dead. Shutting down the cluster…

Have you ever received rewards? It’s strange that you can’t receive the rewards unless the node didn’t finish syncing.
Please try stats tools to find out if it finished syncing.

Sure I have received rewards. Used the stats tool when they weren’t coming anymore and that’s why I recognised in the first place that syncing was ongoing but no more blocks were added, so it was stuck… And after restarting this error occurred. I now have solved it again by downloading the data bootstrap from saturday and restart the cluster with those data, the chain is synced and rewards are coming in again. So the only solution when such an error occurs is downloading the the latest data and start from there…

Well, @junjiah may have some ideas about this.

And yesterday evening it happened again:

SLAVE_S0: I0622 20:29:23.712678 shard.py:703] [0/0] got new block with height 3691243

SLAVE_S2: E0622 20:29:23.745532 protocol.py:168] Traceback (most recent call last):

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S2: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 152, in handle_metadata_and_raw_data

SLAVE_S2: await self.__handle_rpc_request(op, cmd, rpc_id, metadata)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 127, in __handle_rpc_request

SLAVE_S2: resp = await handler(self, request)

SLAVE_S2: File “slave.py”, line 749, in handle_add_xshard_tx_list_request

SLAVE_S2: req.minor_block_hash, req.tx_list

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_state.py”, line 1325, in add_cross_shard_tx_list_by_minor_block_hash

SLAVE_S2: self.db.put_minor_block_xshard_tx_list(h, tx_list)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_db_operator.py”, line 478, in put_minor_block_xshard_tx_list

SLAVE_S2: self.db.put(b"xShard_" + h, tx_list.serialize())

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 823391264, got 1138716802 in ./qkc-data/mainnet/shard-131073.db/006482.sst offset 59586622 size 3341’

SLAVE_S2:

SLAVE_S2: I0622 20:29:23.745861 slave.py:710] Closing connection with slave b’S0’

SLAVE_S0: E0622 20:29:23.749307 protocol.py:181] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 175, in loop_once

SLAVE_S0: metadata, raw_data = await self.read_metadata_and_raw_data()

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 269, in read_metadata_and_raw_data

SLAVE_S0: size_bytes = await self.__read_fully(4, allow_eof=True)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 254, in __read_fully

SLAVE_S0: bs = await self.reader.read(n)

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/streams.py”, line 640, in read

SLAVE_S0: await self._wait_for_data(‘read’)

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/streams.py”, line 473, in _wait_for_data

SLAVE_S0: await self._waiter

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/selector_events.py”, line 804, in _read_ready__data_received

SLAVE_S0: data = self._sock.recv(self.max_size)

SLAVE_S0: ConnectionResetError: [Errno 104] Connection reset by peer

SLAVE_S0:

SLAVE_S0: I0622 20:29:23.749708 slave.py:710] Closing connection with slave b’S2’

SLAVE_S0: E0622 20:29:23.750336 protocol.py:168] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S0: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 147, in handle_metadata_and_raw_data

SLAVE_S0: await self.__handle_request(op, cmd)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 123, in __handle_request

SLAVE_S0: await handler(self, op, request, 0)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 197, in handle_new_block_minor_command

SLAVE_S0: await self.shard.handle_new_block(cmd.block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 708, in handle_new_block

SLAVE_S0: await self.add_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 770, in add_block

SLAVE_S0: await self.slave.broadcast_xshard_tx_list(block, xshard_list, prev_root_height)

SLAVE_S0: File “slave.py”, line 1105, in broadcast_xshard_tx_list

SLAVE_S0: responses = await asyncio.gather(*rpc_futures)

SLAVE_S0: RuntimeError: S0<->S2: connection abort

SLAVE_S0:

SLAVE_S0: E0622 20:29:23.750675 shard.py:68] Closing shard connection with error slave_master_vconn_11408676719790611844: error processing request: S0<->S2: connection abort

seems the key reason is about rocksdb

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 1300880301, got 2310159692 in ./qkc-data/mainnet/shard-393217.db/006376.sst offset 31076911 size 2349’

SLAVE_S2:

SLAVE_S2: I0620 12:49:51.348179 slave.py:710] Closing connection with slave b’S0’

Can you give us your command for starting the docker image and the cluster? I wonder if the dependency is broken. you are probably the only one who regularly sees this issue.

Hi,

I first export the config file:

export QKC_CONFIG=pwd/mainnet/singularity/cluster_config_template.json

After that start the cluster

python3 quarkchain/cluster/cluster.py --cluster_config $QKC_CONFIG

you said you are using our docker images, right? can you also provide the command on how you start the container?

docker run -v /QKC/data:/code/pyquarkchain/quarkchain/cluster/qkc-data/mainnet -it --restart=always -p 38291:38291 -p 38391:38391 -p 38491:38491 -p 38291:38291/udp quarkchaindocker/pyquarkchain:mainnet1.4.2

Interesting problem. I have setup several nodes but cannot find the issue (although I don’t use these nodes to mine). From the error, it tells that rockdb’s data files are corrupted, which I also heard that several other projects hit the corruption issue (e.g., geth).

I am going to setup a node with mining to see if it is reproducible. Further, do you start the node with the snapshot from us or you start the node without a snapshot (and thus it downloads all the blocks from peers).

I start with the snapshot from you