QKC Cluster errors stops cluster instance on docker

MenMenM · June 16, 2020, 7:30am

Running a cluster I get these errors "SLAVE_S2: W0616 07:20:26.066809 shard.py:827] minor block to sync 47333514e5040b28e6b044172c2592dc11c4fef9aeb1aed31038945cc5323408 is already committed

MASTER: E0616 07:20:26.071119 master.py:120] Traceback (most recent call last):

MASTER: File “master.py”, line 118, in sync

MASTER: await self.__run_sync()

MASTER: File “master.py”, line 262, in __run_sync

MASTER: await self.__add_block(block)

MASTER: File “master.py”, line 285, in __add_block

MASTER: await self.__sync_minor_blocks(root_block.minor_block_header_list)

MASTER: File “master.py”, line 333, in __sync_minor_blocks

MASTER: m_header.get_hash().hex(), m_header.branch.to_str()

MASTER: RuntimeError: minor block 96089d07d4192f921e53d2748e3542a6f0ad4b103dc5aa98d85714ce5f8d1bf2 from 2/0 is still unavailable in master after root block sync

MASTER:

MASTER: I0616 07:20:26.071496 simple_network.py:177] Closing peer 45250555ed31ecac9e3e68fd03bc8b7c540de425c54badca2dd4e771a16e7b0e with the following reason: minor block 96089d07d4192f921e53d2748e3542a6f0ad4b103dc5aa98d85714ce5f8d1bf2 from 2/0 is still unavailable in master after root block sync

MASTER: I0616 07:20:26.071752 p2p_manager.py:301] destroying proxy slave connections for <Node(0x0d62@159.89.142.41)>"

They happen a while and than the cluster shuts down

Still mining is going on “MASTER: I0616 07:22:26.429503 jsonrpc.py:531] {“id”:1,“jsonrpc”:“2.0”,“method”:“getWork”,“params”:[“0x1”]}” from all the clients but no rewards are earned in the end

Any idea how to solve this? I know I can download the latest daily data and than start a new docker instance from that new data, this will solve the problem in the end for a while… until it happens again. Should be an easier and less timeconsuming way. Downloading all the data takes hours

jishankai · June 16, 2020, 4:28pm

Hi, can you tell us how to reproduce it?

MenMenM · June 16, 2020, 6:48pm

I run a QKC cluster inside a docker container and after running some time without any errors this happens. Here is the wiki how to run a cluster in docker https://github.com/QuarkChain/pyquarkchain/wiki/Start-Clusters-on-the-QuarkChain

So I do nothing at all different. Just clients connecting to the cluster to mine

jishankai · June 17, 2020, 1:14am

Hi, which version of image are you using? How often did the error happen? Did you modify the code of pyquarkchain? What miner did you use?

junjiah · June 17, 2020, 1:21am

quarkchain dev here. have made any modification to the code?

we want to have a reliable way to reproduce this problem but can’t. can you help us on writing about how to reproduce this bug reliably?

MenMenM · June 17, 2020, 5:44am

I used mainnet 1.4.2 . The only thing changed is the reward address, set to an address of me and the host set to 0.0.0.0 instead of 127.0.0.1 to make it possible to connect to the docker quark chain instance. The miner I use is ethminer: https://github.com/QuarkChain/ethminer.git compiled from source with CUDA (some miners use cuda 10.2, some cuda 11.0) Everything runs smoothly until the first error after a while (sometimes a couple of hours, some times a day) and than the error keeps occurring, which sometimes crashes the running quarkchain instance, but from the moment of the first error no rewards are received although the miners keep mining and display no errors on the client side and in the quark chain instances the getWorks keep coming

Bukum86 · June 17, 2020, 7:15am

How much RAM you have on your cluster server?

MenMenM · June 17, 2020, 7:17am

8 GB Available RAM of 12 GB, only about 4 GB used in total
4 core CPU load about 2.5 - 3.5

MenMenM · June 19, 2020, 7:52am

Since the last recovery by downloading the last data and restarting the docker image with this data, the cluster is running without errors now for almost 36 hours

MenMenM · June 20, 2020, 12:53pm

Had to stop the cluster again as rewards weren’t coming in. Restarting produced this error lines

SLAVE_S2: E0620 12:49:51.347818 protocol.py:168] Traceback (most recent call last):

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S2: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 152, in handle_metadata_and_raw_data

SLAVE_S2: await self.__handle_rpc_request(op, cmd, rpc_id, metadata)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 127, in __handle_rpc_request

SLAVE_S2: resp = await handler(self, request)

SLAVE_S2: File “slave.py”, line 749, in handle_add_xshard_tx_list_request

SLAVE_S2: req.minor_block_hash, req.tx_list

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_state.py”, line 1325, in add_cross_shard_tx_list_by_minor_block_hash

SLAVE_S2: self.db.put_minor_block_xshard_tx_list(h, tx_list)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_db_operator.py”, line 478, in put_minor_block_xshard_tx_list

SLAVE_S2: self.db.put(b"xShard_" + h, tx_list.serialize())

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 1300880301, got 2310159692 in ./qkc-data/mainnet/shard-393217.db/006376.sst offset 31076911 size 2349’

SLAVE_S2:

SLAVE_S2: I0620 12:49:51.348179 slave.py:710] Closing connection with slave b’S0’

SLAVE_S0: E0620 12:49:51.352953 miner.py:339] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/miner.py”, line 333, in submit_work

SLAVE_S0: await self.add_block_async_func(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 544, in __add_block

SLAVE_S0: await self.handle_new_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 708, in handle_new_block

SLAVE_S0: await self.add_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 770, in add_block

SLAVE_S0: await self.slave.broadcast_xshard_tx_list(block, xshard_list, prev_root_height)

SLAVE_S0: File “slave.py”, line 1105, in broadcast_xshard_tx_list

SLAVE_S0: responses = await asyncio.gather(*rpc_futures)

SLAVE_S0: RuntimeError: S0<->S2: connection abort

and further on

MASTER: E0620 12:50:11.319340 master.py:120] Traceback (most recent call last):

MASTER: File “master.py”, line 118, in sync

MASTER: await self.__run_sync()

MASTER: File “master.py”, line 262, in __run_sync

MASTER: await self.__add_block(block)

MASTER: File “master.py”, line 286, in __add_block

MASTER: await self.master_server.add_root_block(root_block)

MASTER: File “master.py”, line 1280, in add_root_block

MASTER: result_list = await asyncio.gather(*future_list)

MASTER: RuntimeError: master_slave_b’S2’: connection abort

MASTER:

MASTER: I0620 12:50:11.320236 simple_network.py:177] Closing peer 5ec483ebffad228d24f62423c984153d801a7ce8bfdba5019ea721d054ead893 with the following reason: master_slave_b’S2’: connection abort

MASTER: I0620 12:50:11.321521 master.py:423] [R] done sync task 628800 1015aeb71bf08dae59fc2412b398f7b3315a1de296fc01bb74eadcf85322b9fc

MASTER: I0620 12:50:11.322227 master.py:428] [R] synchronizer finished!

MASTER: E0620 12:50:11.326245 p2p_manager.py:146] Unknown exception from <Node(0x5ec4@116.203.214.61)>, message: OperationCancelled(‘Cancellation requested by QuarkServer:p2pserver token’)

MASTER: E0620 12:50:11.333310 p2p_manager.py:147] Traceback (most recent call last):

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 131, in _run

MASTER: metadata, raw_data = await self.secure_peer.read_metadata_and_raw_data()

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 292, in read_metadata_and_raw_data

MASTER: data = await self.quark_peer.read_raw_bytes(timeout=None)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 115, in read_raw_bytes

MASTER: frame_data = await self.read(frame_size + MAC_LEN, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/peer.py”, line 344, in read

MASTER: return await self.wait(self.reader.readexactly(n), timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancellable.py”, line 18, in wait

MASTER: return await self.wait_first(awaitable, token=token, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancellable.py”, line 42, in wait_first

MASTER: return await token_chain.cancellable_wait(*awaitables, timeout=timeout)

MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancel_token/token.py”, line 144, in cancellable_wait

MASTER: “Cancellation requested by {} token”.format(self.triggered_token)

MASTER: quarkchain.p2p.cancel_token.exceptions.OperationCancelled: Cancellation requested by QuarkServer:p2pserver token

MASTER:

SLAVE_S1: I0620 12:50:11.337485 shard_state.py:1493] [1/0] shard tip reset from 3640728 to 3640746 by root block 627356

MASTER: I0620 12:50:11.337944 peer.py:1125] QuarkPeer <Node(0x5ec4@116.203.214.61)> finished, removing from pool

MASTER: I0620 12:50:11.350669 discovery.py:500] stopping discovery

MASTER: I0620 12:50:11.352723 peer.py:943] Stopping all peers …

MASTER: I0620 12:50:11.452553 p2p_server.py:178] Closing server…

SLAVE_S1: I0620 12:50:11.453733 shard_state.py:1493] [5/0] shard tip reset from 3637361 to 3637379 by root block 627356

MASTER: I0620 12:50:11.455222 master.py:1864] Master server is shutdown

SLAVE_S2 is dead. Shutting down the cluster…

jishankai · June 22, 2020, 4:12am

Have you ever received rewards? It’s strange that you can’t receive the rewards unless the node didn’t finish syncing.
Please try stats tools to find out if it finished syncing.

MenMenM · June 22, 2020, 7:20am

Sure I have received rewards. Used the stats tool when they weren’t coming anymore and that’s why I recognised in the first place that syncing was ongoing but no more blocks were added, so it was stuck… And after restarting this error occurred. I now have solved it again by downloading the data bootstrap from saturday and restart the cluster with those data, the chain is synced and rewards are coming in again. So the only solution when such an error occurs is downloading the the latest data and start from there…

jishankai · June 22, 2020, 11:14am

Well, @junjiah may have some ideas about this.

MenMenM · June 23, 2020, 5:38am

And yesterday evening it happened again:

SLAVE_S0: I0622 20:29:23.712678 shard.py:703] [0/0] got new block with height 3691243

SLAVE_S2: E0622 20:29:23.745532 protocol.py:168] Traceback (most recent call last):

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S2: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 152, in handle_metadata_and_raw_data

SLAVE_S2: await self.__handle_rpc_request(op, cmd, rpc_id, metadata)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 127, in __handle_rpc_request

SLAVE_S2: resp = await handler(self, request)

SLAVE_S2: File “slave.py”, line 749, in handle_add_xshard_tx_list_request

SLAVE_S2: req.minor_block_hash, req.tx_list

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_state.py”, line 1325, in add_cross_shard_tx_list_by_minor_block_hash

SLAVE_S2: self.db.put_minor_block_xshard_tx_list(h, tx_list)

SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_db_operator.py”, line 478, in put_minor_block_xshard_tx_list

SLAVE_S2: self.db.put(b"xShard_" + h, tx_list.serialize())

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 823391264, got 1138716802 in ./qkc-data/mainnet/shard-131073.db/006482.sst offset 59586622 size 3341’

SLAVE_S2:

SLAVE_S2: I0622 20:29:23.745861 slave.py:710] Closing connection with slave b’S0’

SLAVE_S0: E0622 20:29:23.749307 protocol.py:181] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 175, in loop_once

SLAVE_S0: metadata, raw_data = await self.read_metadata_and_raw_data()

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 269, in read_metadata_and_raw_data

SLAVE_S0: size_bytes = await self.__read_fully(4, allow_eof=True)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 254, in __read_fully

SLAVE_S0: bs = await self.reader.read(n)

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/streams.py”, line 640, in read

SLAVE_S0: await self._wait_for_data(‘read’)

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/streams.py”, line 473, in _wait_for_data

SLAVE_S0: await self._waiter

SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/selector_events.py”, line 804, in _read_ready__data_received

SLAVE_S0: data = self._sock.recv(self.max_size)

SLAVE_S0: ConnectionResetError: [Errno 104] Connection reset by peer

SLAVE_S0:

SLAVE_S0: I0622 20:29:23.749708 slave.py:710] Closing connection with slave b’S2’

SLAVE_S0: E0622 20:29:23.750336 protocol.py:168] Traceback (most recent call last):

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data

SLAVE_S0: await self.handle_metadata_and_raw_data(metadata, raw_data)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 147, in handle_metadata_and_raw_data

SLAVE_S0: await self.__handle_request(op, cmd)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 123, in __handle_request

SLAVE_S0: await handler(self, op, request, 0)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 197, in handle_new_block_minor_command

SLAVE_S0: await self.shard.handle_new_block(cmd.block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 708, in handle_new_block

SLAVE_S0: await self.add_block(block)

SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 770, in add_block

SLAVE_S0: await self.slave.broadcast_xshard_tx_list(block, xshard_list, prev_root_height)

SLAVE_S0: File “slave.py”, line 1105, in broadcast_xshard_tx_list

SLAVE_S0: responses = await asyncio.gather(*rpc_futures)

SLAVE_S0: RuntimeError: S0<->S2: connection abort

SLAVE_S0:

SLAVE_S0: E0622 20:29:23.750675 shard.py:68] Closing shard connection with error slave_master_vconn_11408676719790611844: error processing request: S0<->S2: connection abort

junjiah · June 23, 2020, 6:48am

seems the key reason is about rocksdb

SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put

SLAVE_S2: return self._db.put(key, value)

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put

SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status

SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 1300880301, got 2310159692 in ./qkc-data/mainnet/shard-393217.db/006376.sst offset 31076911 size 2349’

SLAVE_S2:

SLAVE_S2: I0620 12:49:51.348179 slave.py:710] Closing connection with slave b’S0’

Can you give us your command for starting the docker image and the cluster? I wonder if the dependency is broken. you are probably the only one who regularly sees this issue.

MenMenM · June 23, 2020, 9:34am

Hi,

I first export the config file:

export QKC_CONFIG=pwd/mainnet/singularity/cluster_config_template.json

After that start the cluster

python3 quarkchain/cluster/cluster.py --cluster_config $QKC_CONFIG

junjiah · June 23, 2020, 6:07pm

you said you are using our docker images, right? can you also provide the command on how you start the container?

MenMenM · June 24, 2020, 7:55am

docker run -v /QKC/data:/code/pyquarkchain/quarkchain/cluster/qkc-data/mainnet -it --restart=always -p 38291:38291 -p 38391:38391 -p 38491:38491 -p 38291:38291/udp quarkchaindocker/pyquarkchain:mainnet1.4.2

qizhou · June 25, 2020, 6:52am

Interesting problem. I have setup several nodes but cannot find the issue (although I don’t use these nodes to mine). From the error, it tells that rockdb’s data files are corrupted, which I also heard that several other projects hit the corruption issue (e.g., geth).

I am going to setup a node with mining to see if it is reproducible. Further, do you start the node with the snapshot from us or you start the node without a snapshot (and thus it downloads all the blocks from peers).

MenMenM · June 25, 2020, 7:05am

I start with the snapshot from you

Topic		Replies	Views
Cluster Inconsistent State Classification Core Technicals	0	634	March 11, 2020
No QKC mining rewards coming from cluster Minings	3	852	June 22, 2020
Help setting up Cluster Minings	10	1351	March 23, 2021
QKC node syncing problem Core Technicals	6	746	July 14, 2020
Qpool.net Blocks Minings	7	1296	March 23, 2021

QKC Cluster errors stops cluster instance on docker

Related topics