Introduction
This proposal outlines the implementation of a --hotsync
feature in the Concordium Node. The objective of this feature is to significantly reduce the time required for new node synchronization by leveraging regular database dumps. The proposal details necessary modifications to scripts, improved error handling, snapshot integrity verification, and monitoring, as well as critical considerations for securing and optimizing the system.
Motivation
The current node synchronization process can be time-consuming, especially when starting from the genesis block. By utilizing the --hotsync
option, nodes can synchronize from the latest database dump, drastically reducing synchronization time and enhancing overall network efficiency.
Conceptual Approach
- Regular Database Dumps:
- Generate hourly database dumps of the blockchain state.
- Validate these dumps for integrity and upload them to a publicly accessible location (e.g., AWS S3).
- Hotsync Flag:
- Introduce a
--hotsync
flag in the node startup process. - When enabled, the node downloads the latest database dump, initializes its state, and continues syncing from that point onward.
- Script and Configuration Modifications:
- Modify existing scripts and configurations to accommodate the new
--hotsync
feature.
- Hourly Database Dumps by the Concordium Tech Team:
- Concordium’s tech team must establish a reliable process for generating, validating, and uploading hourly database dumps to ensure this feature operates effectively.
Performance Benchmarks
The --hotsync
feature offers a significant performance improvement by allowing nodes to synchronize from the latest available database dump, rather than starting from the genesis block. Below are speculative performance benchmarks based on the assumption that the size of the database dump is 50GB.
Current Synchronization Process (From Genesis Block):
- Time to Synchronize: Syncing a node from the genesis block can take several days, depending on the node’s processing power, network speed, and the amount of historical data the blockchain contains.
- Sync Speed: The process is slow due to the need to verify every block, transaction, and state update from the beginning of the blockchain.
- Block Size and Network Latency: The cumulative size of all blocks from genesis to the current state is much larger than 50GB, making it a resource-intensive and time-consuming task.
- Impact: This leads to slow onboarding of new nodes, preventing quick network expansion and causing delays in node deployment.
Expected Synchronization with --hotsync
:
- Time to Download 50GB Database Dump:
- On a 10 Gbps Connection: Approximately 40 seconds to download the 50GB dump.
- On a 1 Gbps Connection: Approximately 7-8 minutes to download the 50GB dump.
- On a 100 Mbps Connection: Approximately 70-80 minutes to download the 50GB dump.
- On a 10 Mbps Connection: Approximately 11-12 hours to download the 50GB dump.
- Syncing After the Dump:
- Once the dump is restored, the node will only need to sync the blocks from the time the dump was taken to the latest block, which is expected to be a much smaller range of data (depending on blockchain activity).
- Assuming minimal blockchain activity, this final sync should take a few minutes to a few hours, depending on the network connection and block generation speed.
Potential Performance Improvement:
- Reduction in Sync Time: The
--hotsync
feature is expected to reduce node sync times from several days to hours or less (depending on network speed).- On a 10 Gbps Connection: A node could fully sync in under 1 hour, including dump download time.
- On a 1 Gbps Connection: A node could fully sync in under 2 hours, including dump download time.
- On a 100 Mbps Connection: Full sync could occur within 4-6 hours.
- On a 10 Mbps Connection: Full sync may take 12-15 hours, still significantly faster than syncing from the genesis block.
These performance estimates assume ideal conditions for both the network and node hardware. Actual performance will depend on factors like network congestion, node specifications, and the volume of blockchain activity after the dump.
Implementation Steps
- Node Startup Logic
- File:
concordium-node/src/bin/concordium-node.rs
- Implement logic to handle the
--hotsync
flag, including error handling, retry mechanisms, and checksum verification to ensure dump integrity.
- Build Script for Database Exporter
- File:
concordium-node/scripts/db-exporter/build.sh
- Update the build script to reflect changes for creating and managing database exports, ensuring compatibility with the
--hotsync
feature.
- Database Exporter Publish Script
- File:
concordium-node/scripts/db-exporter/database-exporter-publish.sh
- Enhance this script to generate checksum files for dump integrity verification and upload them along with the database dumps.
- Start Script
- File:
concordium-node/scripts/start.sh
- Modify to include logic for handling
--hotsync
flags and URLs during node startup, ensuring backward compatibility with existing setups.
Testing Plan and Strategy
To ensure the successful implementation and reliability of the --hotsync
feature, a robust testing plan is essential. The testing strategy should cover a variety of conditions to simulate real-world scenarios, ensure data integrity, and verify performance improvements.
1. Functional Testing:
- Objective: Ensure the
--hotsync
feature works as expected under normal conditions. - Test Cases:
- Sync a node using the
--hotsync
flag and verify that it correctly downloads and restores the latest database dump. - Confirm that the node correctly syncs the remaining blocks after restoring the dump.
- Validate the integrity of the node’s state after syncing is complete (using checksum and post-sync verification).
- Sync a node using the
2. Performance Testing:
- Objective: Measure the time saved by using
--hotsync
compared to traditional node synchronization. - Test Cases:
- Compare the sync time for nodes syncing from the genesis block versus nodes using
--hotsync
under various network conditions (e.g., 1 Gbps, 100 Mbps, and 10 Mbps). - Measure the time taken to download the 50GB dump and sync the remaining blocks.
- Track CPU, memory, and network usage during the sync process to identify any potential bottlenecks.
- Compare the sync time for nodes syncing from the genesis block versus nodes using
3. Stress Testing:
- Objective: Ensure that the system can handle multiple nodes simultaneously syncing via
--hotsync
. - Test Cases:
- Simulate the scenario where multiple nodes attempt to download the dump simultaneously and evaluate the impact on bandwidth and server performance.
- Evaluate how the system behaves when there is a significant backlog of blocks to sync after restoring the dump.
4. Error Handling and Resilience:
- Objective: Test the robustness of the system when facing network issues, corrupted dumps, or incomplete downloads.
- Test Cases:
- Simulate an interrupted or failed download and verify that the node retries the download or resumes from the last checkpoint.
- Use an invalid or corrupted dump and verify that the node correctly identifies the issue and aborts the process.
- Test how the node behaves when the connection to the dump server is slow or unstable.
5. Backward Compatibility Testing:
- Objective: Ensure that nodes not using the
--hotsync
flag can still sync from the genesis block as expected. - Test Cases:
- Run nodes without the
--hotsync
flag and confirm that traditional syncing from genesis works without any issues. - Verify that the introduction of
--hotsync
does not negatively affect nodes using the standard sync process.
- Run nodes without the
Additional Considerations and Risk Mitigation
The --hotsync
feature presents many advantages but also raises several concerns that must be addressed to ensure secure, scalable, and reliable operation.
1. Security of Public Database Dumps
- Checksum Verification: To safeguard the integrity of the database dumps, checksum verification will be mandatory. Nodes will verify the checksum of the dump before applying it, ensuring that only validated dumps are used.
- Signed URLs for Access: For added security, Concordium may consider using signed URLs or authenticated access to restrict who can download the dumps, preventing malicious actors from misusing the feature.
2. Decentralization and Storage Reliability
- Decentralized Storage: Although AWS S3 is reliable, using decentralized storage systems such as IPFS or Filecoin could better align with blockchain’s decentralization principles. This option should be considered as the network scales.
- Multi-region CDN Deployment: By distributing the dumps across multiple regions through a CDN like AWS CloudFront, nodes can access them faster and more reliably.
3. Performance Impact of Hourly Dumps
- Dedicated Nodes for Dump Generation: Generating database dumps every hour can consume considerable resources. Using dedicated nodes or servers to generate and upload dumps will mitigate the risk of slowing down the main network.
- Scheduling Low-Traffic Periods: Whenever possible, the generation of dumps should occur during periods of low network activity to minimize the impact on overall network performance.
4. Data Accuracy and Synchronization Consistency
- Incremental Sync After Dump: After restoring from the database dump, nodes will continue syncing any missing blocks from the network to ensure they are fully synchronized and up-to-date.
- Frequent Dump Updates: The hourly dumps provide a reasonable balance between minimizing sync time and ensuring the accuracy of the data. If needed, the frequency can be adjusted to match network demands.
5. Scalability and Storage Management
- Retention Policy for Dumps: Implement a configurable retention policy to retain only the last 24-48 dumps. This will minimize long-term storage requirements while ensuring sufficient historical data for new node syncs.
- Efficient Storage Use: Compress database dumps (e.g., using
gzip
) to reduce storage space usage and manage hosting costs more efficiently.
6. Handling Network Variability
- Retry Logic and Incremental Downloads: Network issues may arise during large dump downloads. Incorporate retry mechanisms and allow incremental downloads so nodes can handle interruptions effectively.
- Bandwidth Throttling: Offer throttling options for nodes with limited bandwidth to ensure they can handle large downloads without being overwhelmed.
7. Node Vulnerability During Hotsync
- Post-Hotsync Validation: Perform post-hotsync validation checks to ensure the node is in a valid state before rejoining the network. Nodes should be restricted from interacting with the network until synchronization is complete.
- Network Access Control: To further protect nodes during synchronization, restrict external access until the hotsync process and validation checks are complete.
8. Legal and Compliance Considerations
- Data Privacy and Jurisdiction: Ensure compliance with local data privacy regulations when distributing database dumps. Where necessary, apply geo-fencing or encryption to safeguard sensitive data in regions with stricter legal requirements.
- Compliance Auditing: Periodically review storage and distribution methods to maintain compliance with applicable legal standards.
9. Backwards Compatibility
- Optional and Backward-Compatible: The
--hotsync
feature is optional, meaning nodes that prefer the traditional synchronization method can continue using it without disruption.
10. CDN Overload and Hosting Costs
- Load Balancing and Mirroring: Use load balancing and mirroring to distribute the demand for database dumps across multiple servers and regions. This will prevent CDN overload and ensure consistent performance.
- Monitoring Usage: Implement monitoring and alerting to track dump access rates and performance. This ensures that the hosting infrastructure can handle peak traffic and usage patterns efficiently.
Benefits of --hotsync
- Reduced Sync Times: Synchronizing from the latest validated dump allows new nodes to quickly catch up with the network.
- Improved Reliability: Validated database dumps ensure that nodes sync from a consistent and accurate state.
- Flexibility and Scalability: The feature simplifies the deployment of new nodes in large-scale networks.
- Network Resilience: With robust error handling and retry logic, nodes can recover from network issues quickly.
Summary
Implementing the --hotsync
feature involves several key modifications to the Concordium Node project. By introducing validated, hourly database dumps, the network can achieve faster, more efficient node synchronization. Addressing potential security, scalability, and performance concerns through proper risk mitigation will ensure this feature delivers both reliability and flexibility to the Concordium network.
I am sharing the 4 conceptual files in a zip here:
I would like to invite the community and tech and science teams to give feedback or to open a discussion on specific points to identify any overlooked aspects.
For any additional questions or further clarification, please do not hesitate to reach out.