Proposal for Implementing --hotsync Feature in Concordium Node

Introduction

This proposal outlines the implementation of a --hotsync feature in the Concordium Node. The objective of this feature is to significantly reduce the time required for new node synchronization by leveraging regular database dumps. The proposal details necessary modifications to scripts, improved error handling, snapshot integrity verification, and monitoring, as well as critical considerations for securing and optimizing the system.

Motivation

The current node synchronization process can be time-consuming, especially when starting from the genesis block. By utilizing the --hotsync option, nodes can synchronize from the latest database dump, drastically reducing synchronization time and enhancing overall network efficiency.


Conceptual Approach

  1. Regular Database Dumps:
  • Generate hourly database dumps of the blockchain state.
  • Validate these dumps for integrity and upload them to a publicly accessible location (e.g., AWS S3).
  1. Hotsync Flag:
  • Introduce a --hotsync flag in the node startup process.
  • When enabled, the node downloads the latest database dump, initializes its state, and continues syncing from that point onward.
  1. Script and Configuration Modifications:
  • Modify existing scripts and configurations to accommodate the new --hotsync feature.
  1. Hourly Database Dumps by the Concordium Tech Team:
  • Concordium’s tech team must establish a reliable process for generating, validating, and uploading hourly database dumps to ensure this feature operates effectively.

Performance Benchmarks

The --hotsync feature offers a significant performance improvement by allowing nodes to synchronize from the latest available database dump, rather than starting from the genesis block. Below are speculative performance benchmarks based on the assumption that the size of the database dump is 50GB.

Current Synchronization Process (From Genesis Block):

  • Time to Synchronize: Syncing a node from the genesis block can take several days, depending on the node’s processing power, network speed, and the amount of historical data the blockchain contains.
    • Sync Speed: The process is slow due to the need to verify every block, transaction, and state update from the beginning of the blockchain.
    • Block Size and Network Latency: The cumulative size of all blocks from genesis to the current state is much larger than 50GB, making it a resource-intensive and time-consuming task.
    • Impact: This leads to slow onboarding of new nodes, preventing quick network expansion and causing delays in node deployment.

Expected Synchronization with --hotsync:

  • Time to Download 50GB Database Dump:
  • On a 10 Gbps Connection: Approximately 40 seconds to download the 50GB dump.
  • On a 1 Gbps Connection: Approximately 7-8 minutes to download the 50GB dump.
  • On a 100 Mbps Connection: Approximately 70-80 minutes to download the 50GB dump.
  • On a 10 Mbps Connection: Approximately 11-12 hours to download the 50GB dump.
  • Syncing After the Dump:
    • Once the dump is restored, the node will only need to sync the blocks from the time the dump was taken to the latest block, which is expected to be a much smaller range of data (depending on blockchain activity).
    • Assuming minimal blockchain activity, this final sync should take a few minutes to a few hours, depending on the network connection and block generation speed.

Potential Performance Improvement:

  • Reduction in Sync Time: The --hotsync feature is expected to reduce node sync times from several days to hours or less (depending on network speed).
    • On a 10 Gbps Connection: A node could fully sync in under 1 hour, including dump download time.
    • On a 1 Gbps Connection: A node could fully sync in under 2 hours, including dump download time.
    • On a 100 Mbps Connection: Full sync could occur within 4-6 hours.
    • On a 10 Mbps Connection: Full sync may take 12-15 hours, still significantly faster than syncing from the genesis block.

These performance estimates assume ideal conditions for both the network and node hardware. Actual performance will depend on factors like network congestion, node specifications, and the volume of blockchain activity after the dump.


Implementation Steps

  1. Node Startup Logic
  • File: concordium-node/src/bin/concordium-node.rs
  • Implement logic to handle the --hotsync flag, including error handling, retry mechanisms, and checksum verification to ensure dump integrity.
  1. Build Script for Database Exporter
  • File: concordium-node/scripts/db-exporter/build.sh
  • Update the build script to reflect changes for creating and managing database exports, ensuring compatibility with the --hotsync feature.
  1. Database Exporter Publish Script
  • File: concordium-node/scripts/db-exporter/database-exporter-publish.sh
  • Enhance this script to generate checksum files for dump integrity verification and upload them along with the database dumps.
  1. Start Script
  • File: concordium-node/scripts/start.sh
  • Modify to include logic for handling --hotsync flags and URLs during node startup, ensuring backward compatibility with existing setups.

Testing Plan and Strategy

To ensure the successful implementation and reliability of the --hotsync feature, a robust testing plan is essential. The testing strategy should cover a variety of conditions to simulate real-world scenarios, ensure data integrity, and verify performance improvements.

1. Functional Testing:

  • Objective: Ensure the --hotsync feature works as expected under normal conditions.
  • Test Cases:
    • Sync a node using the --hotsync flag and verify that it correctly downloads and restores the latest database dump.
    • Confirm that the node correctly syncs the remaining blocks after restoring the dump.
    • Validate the integrity of the node’s state after syncing is complete (using checksum and post-sync verification).

2. Performance Testing:

  • Objective: Measure the time saved by using --hotsync compared to traditional node synchronization.
  • Test Cases:
    • Compare the sync time for nodes syncing from the genesis block versus nodes using --hotsync under various network conditions (e.g., 1 Gbps, 100 Mbps, and 10 Mbps).
    • Measure the time taken to download the 50GB dump and sync the remaining blocks.
    • Track CPU, memory, and network usage during the sync process to identify any potential bottlenecks.

3. Stress Testing:

  • Objective: Ensure that the system can handle multiple nodes simultaneously syncing via --hotsync.
  • Test Cases:
    • Simulate the scenario where multiple nodes attempt to download the dump simultaneously and evaluate the impact on bandwidth and server performance.
    • Evaluate how the system behaves when there is a significant backlog of blocks to sync after restoring the dump.

4. Error Handling and Resilience:

  • Objective: Test the robustness of the system when facing network issues, corrupted dumps, or incomplete downloads.
  • Test Cases:
    • Simulate an interrupted or failed download and verify that the node retries the download or resumes from the last checkpoint.
    • Use an invalid or corrupted dump and verify that the node correctly identifies the issue and aborts the process.
    • Test how the node behaves when the connection to the dump server is slow or unstable.

5. Backward Compatibility Testing:

  • Objective: Ensure that nodes not using the --hotsync flag can still sync from the genesis block as expected.
  • Test Cases:
    • Run nodes without the --hotsync flag and confirm that traditional syncing from genesis works without any issues.
    • Verify that the introduction of --hotsync does not negatively affect nodes using the standard sync process.

Additional Considerations and Risk Mitigation

The --hotsync feature presents many advantages but also raises several concerns that must be addressed to ensure secure, scalable, and reliable operation.

1. Security of Public Database Dumps

  • Checksum Verification: To safeguard the integrity of the database dumps, checksum verification will be mandatory. Nodes will verify the checksum of the dump before applying it, ensuring that only validated dumps are used.
  • Signed URLs for Access: For added security, Concordium may consider using signed URLs or authenticated access to restrict who can download the dumps, preventing malicious actors from misusing the feature.

2. Decentralization and Storage Reliability

  • Decentralized Storage: Although AWS S3 is reliable, using decentralized storage systems such as IPFS or Filecoin could better align with blockchain’s decentralization principles. This option should be considered as the network scales.
  • Multi-region CDN Deployment: By distributing the dumps across multiple regions through a CDN like AWS CloudFront, nodes can access them faster and more reliably.

3. Performance Impact of Hourly Dumps

  • Dedicated Nodes for Dump Generation: Generating database dumps every hour can consume considerable resources. Using dedicated nodes or servers to generate and upload dumps will mitigate the risk of slowing down the main network.
  • Scheduling Low-Traffic Periods: Whenever possible, the generation of dumps should occur during periods of low network activity to minimize the impact on overall network performance.

4. Data Accuracy and Synchronization Consistency

  • Incremental Sync After Dump: After restoring from the database dump, nodes will continue syncing any missing blocks from the network to ensure they are fully synchronized and up-to-date.
  • Frequent Dump Updates: The hourly dumps provide a reasonable balance between minimizing sync time and ensuring the accuracy of the data. If needed, the frequency can be adjusted to match network demands.

5. Scalability and Storage Management

  • Retention Policy for Dumps: Implement a configurable retention policy to retain only the last 24-48 dumps. This will minimize long-term storage requirements while ensuring sufficient historical data for new node syncs.
  • Efficient Storage Use: Compress database dumps (e.g., using gzip) to reduce storage space usage and manage hosting costs more efficiently.

6. Handling Network Variability

  • Retry Logic and Incremental Downloads: Network issues may arise during large dump downloads. Incorporate retry mechanisms and allow incremental downloads so nodes can handle interruptions effectively.
  • Bandwidth Throttling: Offer throttling options for nodes with limited bandwidth to ensure they can handle large downloads without being overwhelmed.

7. Node Vulnerability During Hotsync

  • Post-Hotsync Validation: Perform post-hotsync validation checks to ensure the node is in a valid state before rejoining the network. Nodes should be restricted from interacting with the network until synchronization is complete.
  • Network Access Control: To further protect nodes during synchronization, restrict external access until the hotsync process and validation checks are complete.

8. Legal and Compliance Considerations

  • Data Privacy and Jurisdiction: Ensure compliance with local data privacy regulations when distributing database dumps. Where necessary, apply geo-fencing or encryption to safeguard sensitive data in regions with stricter legal requirements.
  • Compliance Auditing: Periodically review storage and distribution methods to maintain compliance with applicable legal standards.

9. Backwards Compatibility

  • Optional and Backward-Compatible: The --hotsync feature is optional, meaning nodes that prefer the traditional synchronization method can continue using it without disruption.

10. CDN Overload and Hosting Costs

  • Load Balancing and Mirroring: Use load balancing and mirroring to distribute the demand for database dumps across multiple servers and regions. This will prevent CDN overload and ensure consistent performance.
  • Monitoring Usage: Implement monitoring and alerting to track dump access rates and performance. This ensures that the hosting infrastructure can handle peak traffic and usage patterns efficiently.

Benefits of --hotsync

  • Reduced Sync Times: Synchronizing from the latest validated dump allows new nodes to quickly catch up with the network.
  • Improved Reliability: Validated database dumps ensure that nodes sync from a consistent and accurate state.
  • Flexibility and Scalability: The feature simplifies the deployment of new nodes in large-scale networks.
  • Network Resilience: With robust error handling and retry logic, nodes can recover from network issues quickly.

Summary

Implementing the --hotsync feature involves several key modifications to the Concordium Node project. By introducing validated, hourly database dumps, the network can achieve faster, more efficient node synchronization. Addressing potential security, scalability, and performance concerns through proper risk mitigation will ensure this feature delivers both reliability and flexibility to the Concordium network.

I am sharing the 4 conceptual files in a zip here:

I would like to invite the community and tech and science teams to give feedback or to open a discussion on specific points to identify any overlooked aspects.

For any additional questions or further clarification, please do not hesitate to reach out.

1 Like

Summary of File Changes for Implementing the --hotsync Feature

Below is a breakdown of the main file modifications, as shared on the google drive link, to implement the --hotsync feature, which will significantly reduce node sync times by allowing nodes to download and sync from the latest database dump:


1. Start Script (start.sh):

  • Purpose: Manages node startup and handles the --hotsync flag.
  • Key Changes:
    • Added logic to check for the --hotsync flag.
    • If the --hotsync flag is present, the node downloads and syncs from the latest database dump via the HOTSYNC_URL.
    • Default fallback to the standard sync method if the flag is not provided.
  • Proposed Improvement:
    • Logging: Add detailed logs to track whether the --hotsync feature is triggered or if the standard sync process is used. This will help monitor and debug the process.

2. Database Exporter Publish Script (database-exporter-publish.sh):

  • Purpose: Handles the export of the database dump, validation, checksum generation, and uploading the dump to S3. It also invalidates the CloudFront cache to ensure the latest dump is delivered.
  • Key Changes:
    • Automates the export of the database, validates it, generates a checksum, and uploads both the dump and checksum to S3.
    • Includes logic to retry the export and validation process in case of failure.
    • Triggers CloudFront cache invalidation to make sure the latest dump is accessible.
  • Proposed Improvement:
    • Detailed Error Logs: Add logs to track each step of the export and validation process, such as when checksum generation happens or retries are triggered. This will provide transparency and help identify failures early.
    • Decentralized File Storage: Consider adding support for decentralized file storage platforms like IPFS or Filecoin to reduce reliance on centralized services like AWS S3. This would align with the principles of decentralization in the blockchain community and offer a more resilient, distributed storage solution.

3. Build Script (build.sh):

  • Purpose: Automates the setup of the database exporter, including creating necessary directories, configuration files, and systemd service timers.
  • Key Changes:
    • Sets up the database-exporter service, ensuring automated hourly exports of the blockchain state.
    • Configures systemd to run the exporter at regular intervals and ensures it has the required S3 and CloudFront details.
  • Proposed Improvement:
    • Post-Install Logging: Add logs during the post-installation phase, particularly for configuring the S3 bucket and CloudFront details. This will allow quick verification of the setup.
    • Support for Decentralized Storage: Include configurations for decentralized file storage (IPFS, Filecoin) to offer users an alternative to traditional cloud storage platforms. This would further align the --hotsync feature with the decentralized ethos of blockchain technology.

4. Node Startup Logic (concordium-node.rs):

  • Purpose: Implements the core logic for handling the --hotsync flag, including downloading, validating, and restoring the database dump.
  • Key Changes:
    • Added logic to handle the --hotsync flag, download the dump from the provided URL, and restore the database.
    • Includes checksum validation to ensure data integrity before restoring the dump.
    • Added retry logic for network failures and corrupted dump files.
  • Proposed Improvement:
    • Extended Error Handling and Logs: Add logs around potential failure points, such as checksum validation or download failures. This will help monitor and debug issues related to the dump’s integrity or network problems.
    • Decentralized Storage Integration: Allow nodes to fetch the database dump from decentralized storage solutions like IPFS or Filecoin in addition to centralized URLs. This offers flexibility and resilience in case centralized storage services are unavailable or underperforming.

Overall Proposed Improvements for Community, Tech, and Science Teams:

  1. Enhanced Logging and Monitoring:
  • All scripts would benefit from enhanced logging to capture more detailed information on the various steps and actions taken (e.g., when --hotsync is triggered, errors during checksum validation, or retries in case of network issues). This will provide transparency, improve debugging, and allow for better monitoring of the sync process.
  1. Performance Metrics and Monitoring:
  • Consider adding performance metrics tracking, such as the time taken to download the dump, restore it, and sync the remaining blocks. These metrics can provide insights into network performance and identify any bottlenecks in the syncing process.
  1. Testing for Resilience:
  • The feature should be tested under various real-world conditions, such as interrupted downloads, slow network connections, or dump file corruption. This will ensure that the --hotsync feature performs reliably under different scenarios.
  1. Support for Decentralized File Storage:
  • As part of decentralizing Concordium’s infrastructure, the community should consider integrating decentralized storage solutions like IPFS or Filecoin. This would reduce reliance on centralized cloud providers and align with the decentralized ethos of blockchain technology. It also offers a more resilient approach, as these decentralized networks have built-in redundancy and censorship resistance.
1 Like

Proposal to Introduce Asynchronous Handling for Network Requests and Database Operations in the Concordium Node

Following on from the previous discussion on the --hotsync feature, we can take additional steps to optimize the Concordium Node by introducing asynchronous handling for network requests and database operations, especially in the context of --hotsync operations. This will further enhance the performance and scalability of the node, allowing it to handle more tasks concurrently without blocking other critical operations.


Recommended Changes and Enhancements

1. Start Script (start.sh)

Objective: To manage node startup more efficiently, the --hotsync feature can be enhanced by adding asynchronous handling for downloading and processing database dumps.

  • Changes:
    • Add asynchronous invocation for the --hotsync feature, allowing the process to run concurrently without blocking other node operations.
if [ -n "$HOTSYNC_URL" ]; then
    /concordium-node --hotsync --hotsync-url="$HOTSYNC_URL" $ARGS &
else
    /concordium-node $ARGS &
fi
  • Proposed Improvement:
    • Adding more detailed logs to track the --hotsync process will help monitor and debug the system when the feature is triggered, as well as log performance data for analysis.

2. Node Startup Logic (concordium-node.rs)

Objective: To handle network requests and database operations asynchronously using the Rust tokio library.

  • Changes:
    • Implement asynchronous handling for downloading database dumps and processing them for --hotsync via tokio. This ensures that network I/O and database operations are non-blocking, improving responsiveness.
use tokio::fs::File;
use tokio::io::{self, AsyncWriteExt};
use reqwest::get;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    if let Some(url) = get_hotsync_url() {
        hotsync(url).await?;
    }
    Ok(())
}

async fn hotsync(url: String) -> Result<(), Box<dyn std::error::Error>> {
    let response = get(&url).await?;
    let mut file = File::create("/tmp/latest_db_dump.sql").await?;
    io::copy(&mut response.bytes().await?.as_ref(), &mut file).await?;
    Ok(())
}
  • Proposed Improvement:
    • Introducing extended error handling for situations like corrupted dumps, failed downloads, or slow network speeds. Adding detailed logs for errors will help monitor failures and retry logic.

3. Build Script (Cargo.toml)

Objective: Enable asynchronous operations by adding the necessary dependencies for async handling in Rust.

  • Changes:
    • Update the Cargo.toml to include libraries for async operations, such as tokio and reqwest:
[dependencies]
tokio = { version = "1", features = ["full"] }
reqwest = { version = "0.11", features = ["json", "blocking"] }
  • Proposed Improvement:
    • Ensure that the dependencies are updated to handle async file handling, networking, and error recovery seamlessly.

Combined Optimization Overview

Node Deployment and Sync Time Reduction

For nodes using the --hotsync feature, sync times can be drastically reduced from several days to just a few hours or minutes, depending on network speed. Here’s an estimation of potential improvements:

  • 10 Gbps Connection: ~20-35 minutes
  • 1 Gbps Connection: ~2-3 hours
  • 100 Mbps Connection: ~4-6 hours
  • 10 Mbps Connection: ~12-15 hours

This offers a significant reduction from the current synchronization times, which can take several days when syncing from the genesis block.

Transaction Processing Speed (TPS) Improvement

With the addition of asynchronous processing, the Concordium Node’s transaction throughput can improve considerably:

  • Current TPS: ~2000 TPS
  • Expected TPS with Async Processing: ~4000-6000 TPS

The introduction of async operations allows nodes to handle more transactions concurrently, leading to better resource utilization and faster processing times.


Performance Impact of Asynchronous Processing

By implementing asynchronous handling for network requests and database operations, the Concordium Node is expected to benefit in several key areas:

1. Reduced Latency and Improved Responsiveness

  • Latency Reduction: Async processing can reduce the time spent waiting for I/O operations by up to 50%, making the node more responsive.
  • Improved Responsiveness: The node can handle multiple operations concurrently, improving overall system performance.

2. Increased Throughput

  • Throughput Increase: The ability to process tasks concurrently can improve throughput by 2-3 times, allowing the node to manage more operations per unit of time.

3. Enhanced Scalability

  • Scalability Factor: With async handling, the node will utilize system resources more efficiently, increasing scalability by 2-3 times.

4. Faster Node Synchronization with --hotsync

  • Sync Time Reduction: Sync times for nodes using the --hotsync flag can be reduced by 50-70%, depending on the network speed and system resources.

5. Robust Error Handling and Resilience

  • Improved Error Handling: Adding retry mechanisms for network requests will improve resilience, reducing network operation failures by up to 70-90%.

Testing Plan for Asynchronous Processing

  1. Functional Testing:
  • Ensure that async network requests work as expected, handling database dumps without blocking other operations.
  1. Performance Testing:
  • Compare the system’s throughput and latency before and after implementing async processing, measuring performance improvements.
  1. Stress Testing:
  • Simulate multiple nodes downloading dumps simultaneously to assess the system’s scalability and performance under load.
  1. Error Handling Testing:
  • Test how the system reacts to interrupted downloads, slow connections, or corrupted dumps, ensuring robust recovery mechanisms.

Summary

Implementing asynchronous processing for network requests and database operations will significantly enhance the performance, responsiveness, and scalability of the Concordium Node. By reducing sync times, increasing throughput, and improving error resilience, this approach will complement the existing --hotsync feature proposal, making the node more robust and efficient.

I invite the community, along with the tech and science teams, to review these changes and provide feedback or raise any potential concerns. Let’s open a discussion on how we can further refine these enhancements and ensure the Concordium Node is optimized for scalability and performance.

Areas for Further Refinement:

  • High Block Activity Scenarios: More clarification on handling scenarios with heavy blockchain activity after dumps.
  • Detailed Decentralized Storage Integration: A deeper dive into how decentralized storage solutions could be implemented.
  • Security Enhancements: More emphasis on securing public database dumps and mitigating potential vulnerabilities.

Haven’t personally played with the node in a while, however, I am pretty sure there are parts of this that already exists and just needs smaller modifications, like loading the DB when booting up. What I find clever in this proposal is that with the righ solution in place to continuously create historic versions, there wont be any manual work needed by the core team, hence, the service level will remain high.

After spending additional time analyzing the Haskell code in the Concordium node, I would like to refine the proposed --hotsync feature to better align with the node’s current architecture and logic. The initial proposal, focusing on external scripts and snapshot handling, was conceptual. However, based on a deeper dive into the codebase, particularly the concordium-node binaries and internal state management, I’ve identified several areas where this feature could be more tightly integrated for performance improvements:

  1. Async State Management: Rather than handling database dumps entirely through external scripts, I propose integrating asynchronous handling of state initialization directly in the node’s Haskell components. This would allow smoother syncing with minimal downtime, enabling the node to handle the state load in parallel with the current block sync operations.
  2. Error Handling and Snapshots: By integrating Haskell’s native concurrency and error handling mechanisms, we can optimize the checksum verification and snapshot integrity processes to ensure they work seamlessly with Concordium’s consensus layer, reducing the need for additional bash scripts.
  3. Granularity in Dumps: Given the potential size of the state data, we could further optimize by segmenting the state dump into smaller snapshots. This aligns better with Concordium’s internal data structures and allows for better fault tolerance during the --hotsync process.

These refinements are important as they bring the hotsync process closer to the core node logic, allowing for smoother node restarts and faster syncing with fewer external dependencies.

I welcome any feedback or input on this refined direction, particularly on how to best handle state management asynchronously within the node while ensuring integrity and security.

A Conceptual approach, including local backups and decentralization of recovery

The goal is to avoid any central point of failure and improve overall speed, particularly when dealing with large data (e.g., 50GB files). Below is the approach:

1. Understand the Current Export Functionality

  • Current Implementation: Review the current backup/export functionality, which likely uses external scripts and lacks asynchronous support.
  • Identify Bottlenecks: Focus on the key areas that are causing inefficiencies in the current process. These include:
    • External script usage instead of native Haskell code.
    • Lack of async support, leading to long wait times when handling large files.

2. Design a Native Haskell Backup/Export Function

Goal: Shift from the external scripts into a native Haskell-based export function that integrates directly into the node’s codebase.

Steps:

  • Create a New Haskell Module for Export Functionality:
    • Define a module (e.g., Concordium.Export.Backup) to handle the backup/export logic.
    • This module will manage data exports from the node’s internal state to a backup file on the local storage.
  • Integrate Async Handling:
    • Use Haskell’s async library or similar functionality to ensure the export process does not block other node operations.
    • Ensure that each stage of the backup process (data retrieval, serialization, file writing) can run asynchronously.
  • Streaming Large Files:
    • Use streaming I/O instead of loading everything into memory at once to handle large datasets.
    • Implement an optimized file writer that writes chunks of data to the file asynchronously, improving both memory efficiency and speed.

Code Design:

haskell

module Concordium.Export.Backup where

import qualified Data.ByteString as BS
import Control.Concurrent.Async (async, wait)
import System.IO (withFile, IOMode(..))

-- Example Async Function for Exporting Data
exportDataAsync :: FilePath -> IO ()
exportDataAsync filepath = do
    -- Export function runs asynchronously
    exportTask <- async (exportData filepath)
    -- Wait for export to complete
    wait exportTask
    return ()

-- Function to Stream Data to File
exportData :: FilePath -> IO ()
exportData filepath = withFile filepath WriteMode $ \handle -> do
    -- Simulate streaming data in chunks
    let chunkSize = 4096 -- bytes
    streamData handle chunkSize
  where
    -- Stream the data in chunks
    streamData :: Handle -> Int -> IO ()
    streamData handle chunkSize = do
      let dataChunk = BS.replicate chunkSize 0 -- Dummy data, replace with actual export logic
      BS.hPut handle dataChunk -- Write chunk to file
      -- Continue streaming more data until the end is reached
      -- This function can be recursive or handle larger chunks.
      return ()

3. Optimize for Local Backups

The key optimization for local backups is to:

  • Avoid Network Latency: Ensure that the export process saves directly to the local filesystem, rather than relying on any central server.
  • Configurable Backup Locations: Provide configuration options in the node’s settings to allow users to define local backup paths. This will offer flexibility while ensuring no central point of failure.

Haskell Code Addition:

  • Extend the Concordium.Export.Backup module to accept user-defined local paths and handle backup retention (e.g., automatically remove older backups to conserve disk space).

haskell

-- Configurable Backup Path
getBackupPath :: IO FilePath
getBackupPath = do
    -- Fetch from configuration or environment
    return "/path/to/backup/location"

-- Export with User-Defined Path
exportWithConfig :: IO ()
exportWithConfig = do
    backupPath <- getBackupPath
    exportDataAsync backupPath

4. Add Validation and Error Handling

Since we want to ensure that the export function is both reliable and performant:

  • Checksum Validation: After the export is complete, calculate a checksum to ensure the integrity of the backup file.
  • Retry Logic: Add retry mechanisms if an export fails, with configurable limits for the number of retries.

haskell

-- Function to Validate Export Integrity (Checksum Example)
validateExport :: FilePath -> IO Bool
validateExport filepath = do
    -- Placeholder for checksum validation logic
    return True

-- Retry Logic Example
retryExport :: FilePath -> Int -> IO ()
retryExport filepath retries
  | retries <= 0 = error "Export failed after maximum retries."
  | otherwise = do
      success <- validateExport filepath
      if success
         then putStrLn "Export successful and validated."
         else retryExport filepath (retries - 1)

5. Expand for Scalability

To ensure scalability and high performance, this export function can:

  • Support Parallelism: If the node is handling multiple chains or data streams, export operations can be parallelized.
  • Decentralization: Backup locally on the node itself to remove reliance on a central server, making the system decentralized and less prone to failures.

6. Test and Benchmark

  • Test for Speed: Run benchmarks on the new export function with varying file sizes (e.g., 50GB and above).
  • Test for Resource Usage: Measure the impact on CPU, memory, and disk I/O to ensure that the async export process does not degrade the node’s overall performance.

Summary of Approach:

  • Shift from Scripts to Haskell: Replace the external script-based export process with a native Haskell-based implementation.
  • Async Support: Introduce asynchronous file handling to improve performance and reduce downtime when exporting large datasets.
  • Local Backup: Allow for user-defined backup paths to store backups locally, avoiding central points of failure.
  • Scalability and Decentralization: Ensure the system can scale with multiple nodes, and decentralize the backup process by making local backups default.

By refining the export function in this manner, we can significantly improve backup times, reduce system load, and increase the Concordium network’s resilience and performance.

The idea proposed is excellent if someone wants to do this locally: storing snapshots is essentially some kind of back-up that can be used in case data gets erased. We are however reticent to do this in a centralized way, because this introduces more trust assumptions. Instead of only trusting that they got the correct genesis block, nodes would now have to trust that the states they download from this central authority are correct, which is the exact opposite direction of where we should be going. We do not want to have a centralized entity that is the source of truth of the state of the chain. It should be the signatures of the (decentralized) validators that certify the state of the chain.

Luckily, there is a decentralized solution to this, which I have explained in the thread Light Node & Fast Catch-up. This explains how we can set up a light node that is much smaller and synchronizes very fast.

The long synchronization times remain for archival nodes. But since these should be run by professionals, they can afford to do local back-ups as suggested in this thread, if they feel the need to recover and synchronize faster.

@chportma replied in the new thread: