How much RAM you have on your cluster server?
8 GB Available RAM of 12 GB, only about 4 GB used in total
4 core CPU load about 2.5 - 3.5
Since the last recovery by downloading the last data and restarting the docker image with this data, the cluster is running without errors now for almost 36 hours
Had to stop the cluster again as rewards weren’t coming in. Restarting produced this error lines
SLAVE_S2: E0620 12:49:51.347818 protocol.py:168] Traceback (most recent call last):
SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data
SLAVE_S2: await self.handle_metadata_and_raw_data(metadata, raw_data)
SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 152, in handle_metadata_and_raw_data
SLAVE_S2: await self.__handle_rpc_request(op, cmd, rpc_id, metadata)
SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 127, in __handle_rpc_request
SLAVE_S2: resp = await handler(self, request)
SLAVE_S2: File “slave.py”, line 749, in handle_add_xshard_tx_list_request
SLAVE_S2: req.minor_block_hash, req.tx_list
SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_state.py”, line 1325, in add_cross_shard_tx_list_by_minor_block_hash
SLAVE_S2: self.db.put_minor_block_xshard_tx_list(h, tx_list)
SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_db_operator.py”, line 478, in put_minor_block_xshard_tx_list
SLAVE_S2: self.db.put(b"xShard_" + h, tx_list.serialize())
SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put
SLAVE_S2: return self._db.put(key, value)
SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put
SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status
SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 1300880301, got 2310159692 in ./qkc-data/mainnet/shard-393217.db/006376.sst offset 31076911 size 2349’
SLAVE_S2:
SLAVE_S2: I0620 12:49:51.348179 slave.py:710] Closing connection with slave b’S0’
SLAVE_S0: E0620 12:49:51.352953 miner.py:339] Traceback (most recent call last):
SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/miner.py”, line 333, in submit_work
SLAVE_S0: await self.add_block_async_func(block)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 544, in __add_block
SLAVE_S0: await self.handle_new_block(block)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 708, in handle_new_block
SLAVE_S0: await self.add_block(block)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 770, in add_block
SLAVE_S0: await self.slave.broadcast_xshard_tx_list(block, xshard_list, prev_root_height)
SLAVE_S0: File “slave.py”, line 1105, in broadcast_xshard_tx_list
SLAVE_S0: responses = await asyncio.gather(*rpc_futures)
SLAVE_S0: RuntimeError: S0<->S2: connection abort
and further on
MASTER: E0620 12:50:11.319340 master.py:120] Traceback (most recent call last):
MASTER: File “master.py”, line 118, in sync
MASTER: await self.__run_sync()
MASTER: File “master.py”, line 262, in __run_sync
MASTER: await self.__add_block(block)
MASTER: File “master.py”, line 286, in __add_block
MASTER: await self.master_server.add_root_block(root_block)
MASTER: File “master.py”, line 1280, in add_root_block
MASTER: result_list = await asyncio.gather(*future_list)
MASTER: RuntimeError: master_slave_b’S2’: connection abort
MASTER:
MASTER: I0620 12:50:11.320236 simple_network.py:177] Closing peer 5ec483ebffad228d24f62423c984153d801a7ce8bfdba5019ea721d054ead893 with the following reason: master_slave_b’S2’: connection abort
MASTER: I0620 12:50:11.321521 master.py:423] [R] done sync task 628800 1015aeb71bf08dae59fc2412b398f7b3315a1de296fc01bb74eadcf85322b9fc
MASTER: I0620 12:50:11.322227 master.py:428] [R] synchronizer finished!
MASTER: E0620 12:50:11.326245 p2p_manager.py:146] Unknown exception from <Node(0x5ec4@116.203.214.61)>, message: OperationCancelled(‘Cancellation requested by QuarkServer:p2pserver token’)
MASTER: E0620 12:50:11.333310 p2p_manager.py:147] Traceback (most recent call last):
MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 131, in _run
MASTER: metadata, raw_data = await self.secure_peer.read_metadata_and_raw_data()
MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 292, in read_metadata_and_raw_data
MASTER: data = await self.quark_peer.read_raw_bytes(timeout=None)
MASTER: File “/code/pyquarkchain/quarkchain/p2p/p2p_manager.py”, line 115, in read_raw_bytes
MASTER: frame_data = await self.read(frame_size + MAC_LEN, timeout=timeout)
MASTER: File “/code/pyquarkchain/quarkchain/p2p/peer.py”, line 344, in read
MASTER: return await self.wait(self.reader.readexactly(n), timeout=timeout)
MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancellable.py”, line 18, in wait
MASTER: return await self.wait_first(awaitable, token=token, timeout=timeout)
MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancellable.py”, line 42, in wait_first
MASTER: return await token_chain.cancellable_wait(*awaitables, timeout=timeout)
MASTER: File “/code/pyquarkchain/quarkchain/p2p/cancel_token/token.py”, line 144, in cancellable_wait
MASTER: “Cancellation requested by {} token”.format(self.triggered_token)
MASTER: quarkchain.p2p.cancel_token.exceptions.OperationCancelled: Cancellation requested by QuarkServer:p2pserver token
MASTER:
SLAVE_S1: I0620 12:50:11.337485 shard_state.py:1493] [1/0] shard tip reset from 3640728 to 3640746 by root block 627356
MASTER: I0620 12:50:11.337944 peer.py:1125] QuarkPeer <Node(0x5ec4@116.203.214.61)> finished, removing from pool
MASTER: I0620 12:50:11.350669 discovery.py:500] stopping discovery
MASTER: I0620 12:50:11.352723 peer.py:943] Stopping all peers …
MASTER: I0620 12:50:11.452553 p2p_server.py:178] Closing server…
SLAVE_S1: I0620 12:50:11.453733 shard_state.py:1493] [5/0] shard tip reset from 3637361 to 3637379 by root block 627356
MASTER: I0620 12:50:11.455222 master.py:1864] Master server is shutdown
SLAVE_S2 is dead. Shutting down the cluster…
Have you ever received rewards? It’s strange that you can’t receive the rewards unless the node didn’t finish syncing.
Please try stats tools to find out if it finished syncing.
Sure I have received rewards. Used the stats tool when they weren’t coming anymore and that’s why I recognised in the first place that syncing was ongoing but no more blocks were added, so it was stuck… And after restarting this error occurred. I now have solved it again by downloading the data bootstrap from saturday and restart the cluster with those data, the chain is synced and rewards are coming in again. So the only solution when such an error occurs is downloading the the latest data and start from there…
And yesterday evening it happened again:
SLAVE_S0: I0622 20:29:23.712678 shard.py:703] [0/0] got new block with height 3691243
SLAVE_S2: E0622 20:29:23.745532 protocol.py:168] Traceback (most recent call last):
SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data
SLAVE_S2: await self.handle_metadata_and_raw_data(metadata, raw_data)
SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 152, in handle_metadata_and_raw_data
SLAVE_S2: await self.__handle_rpc_request(op, cmd, rpc_id, metadata)
SLAVE_S2: File “/code/pyquarkchain/quarkchain/protocol.py”, line 127, in __handle_rpc_request
SLAVE_S2: resp = await handler(self, request)
SLAVE_S2: File “slave.py”, line 749, in handle_add_xshard_tx_list_request
SLAVE_S2: req.minor_block_hash, req.tx_list
SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_state.py”, line 1325, in add_cross_shard_tx_list_by_minor_block_hash
SLAVE_S2: self.db.put_minor_block_xshard_tx_list(h, tx_list)
SLAVE_S2: File “/code/pyquarkchain/quarkchain/cluster/shard_db_operator.py”, line 478, in put_minor_block_xshard_tx_list
SLAVE_S2: self.db.put(b"xShard_" + h, tx_list.serialize())
SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put
SLAVE_S2: return self._db.put(key, value)
SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put
SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status
SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 823391264, got 1138716802 in ./qkc-data/mainnet/shard-131073.db/006482.sst offset 59586622 size 3341’
SLAVE_S2:
SLAVE_S2: I0622 20:29:23.745861 slave.py:710] Closing connection with slave b’S0’
SLAVE_S0: E0622 20:29:23.749307 protocol.py:181] Traceback (most recent call last):
SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 175, in loop_once
SLAVE_S0: metadata, raw_data = await self.read_metadata_and_raw_data()
SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 269, in read_metadata_and_raw_data
SLAVE_S0: size_bytes = await self.__read_fully(4, allow_eof=True)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 254, in __read_fully
SLAVE_S0: bs = await self.reader.read(n)
SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/streams.py”, line 640, in read
SLAVE_S0: await self._wait_for_data(‘read’)
SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/streams.py”, line 473, in _wait_for_data
SLAVE_S0: await self._waiter
SLAVE_S0: File “/usr/local/lib/python3.7/asyncio/selector_events.py”, line 804, in _read_ready__data_received
SLAVE_S0: data = self._sock.recv(self.max_size)
SLAVE_S0: ConnectionResetError: [Errno 104] Connection reset by peer
SLAVE_S0:
SLAVE_S0: I0622 20:29:23.749708 slave.py:710] Closing connection with slave b’S2’
SLAVE_S0: E0622 20:29:23.750336 protocol.py:168] Traceback (most recent call last):
SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 166, in __internal_handle_metadata_and_raw_data
SLAVE_S0: await self.handle_metadata_and_raw_data(metadata, raw_data)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 147, in handle_metadata_and_raw_data
SLAVE_S0: await self.__handle_request(op, cmd)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/protocol.py”, line 123, in __handle_request
SLAVE_S0: await handler(self, op, request, 0)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 197, in handle_new_block_minor_command
SLAVE_S0: await self.shard.handle_new_block(cmd.block)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 708, in handle_new_block
SLAVE_S0: await self.add_block(block)
SLAVE_S0: File “/code/pyquarkchain/quarkchain/cluster/shard.py”, line 770, in add_block
SLAVE_S0: await self.slave.broadcast_xshard_tx_list(block, xshard_list, prev_root_height)
SLAVE_S0: File “slave.py”, line 1105, in broadcast_xshard_tx_list
SLAVE_S0: responses = await asyncio.gather(*rpc_futures)
SLAVE_S0: RuntimeError: S0<->S2: connection abort
SLAVE_S0:
SLAVE_S0: E0622 20:29:23.750675 shard.py:68] Closing shard connection with error slave_master_vconn_11408676719790611844: error processing request: S0<->S2: connection abort
seems the key reason is about rocksdb
SLAVE_S2: File “/code/pyquarkchain/quarkchain/db.py”, line 91, in put
SLAVE_S2: return self._db.put(key, value)
SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 1472, in rocksdb._rocksdb.DB.put
SLAVE_S2: File “rocksdb/_rocksdb.pyx”, line 75, in rocksdb._rocksdb.check_status
SLAVE_S2: rocksdb.errors.Corruption: b’Corruption: block checksum mismatch: expected 1300880301, got 2310159692 in ./qkc-data/mainnet/shard-393217.db/006376.sst offset 31076911 size 2349’
SLAVE_S2:
SLAVE_S2: I0620 12:49:51.348179 slave.py:710] Closing connection with slave b’S0’
Can you give us your command for starting the docker image and the cluster? I wonder if the dependency is broken. you are probably the only one who regularly sees this issue.
Hi,
I first export the config file:
export QKC_CONFIG=pwd
/mainnet/singularity/cluster_config_template.json
After that start the cluster
python3 quarkchain/cluster/cluster.py --cluster_config $QKC_CONFIG
you said you are using our docker images, right? can you also provide the command on how you start the container?
docker run -v /QKC/data:/code/pyquarkchain/quarkchain/cluster/qkc-data/mainnet -it --restart=always -p 38291:38291 -p 38391:38391 -p 38491:38491 -p 38291:38291/udp quarkchaindocker/pyquarkchain:mainnet1.4.2
Interesting problem. I have setup several nodes but cannot find the issue (although I don’t use these nodes to mine). From the error, it tells that rockdb’s data files are corrupted, which I also heard that several other projects hit the corruption issue (e.g., geth).
I am going to setup a node with mining to see if it is reproducible. Further, do you start the node with the snapshot from us or you start the node without a snapshot (and thus it downloads all the blocks from peers).
I start with the snapshot from you
OK. Another issue may be from snapshot. I will also have a try.
Running a pyqkc with snapshot for about 12 hours, no issue is found yet.
Also on my side. I am running cluster now for several days without any issues.
Also in the past I was able to run it for several months without crash on Digital Ocean VPS and also on my home computer.
No update for now as I was waiting if you have found something yet.
Looks like the error can not be reproduced. So have to change the setup, the errors happen too often, good tip to use a VPS server maybe to form a cluster to mine.
the same as here, question did you fix it?
also is there any way to auto restart it?
maybe someone may share best practices to build it for now, thank you in advance