Light Node & Fast Catch-up

chportma · September 17, 2024, 2:43pm

The size of a Concordium node is getting very large, well over 100GB. It also takes several days to synchronize with the chain when starting from scratch. This is partly due to the increase in transactions and partly due to the new consensus protocol which produces a block every 2 seconds.

The following is a sketch of how we intend to handle this. The project hasn’t kicked off yet, and a lot of the details are still missing. But ideally we start planning in Q4 and can have it done by mid 2025.

The current nodes store the entire history of the chain. But this is not needed for most applications. A validator only needs the current state of the chain and a little history to help peers catch-up. Nodes used to read the current state and submit transactions do not need the entire history either. So we intend to develop a light node that only stores the current state and minimal history information.

Where it gets technically complicated (and fun) is how a new node can join a network of light nodes: if you can’t download the entire chain from the genesis block, then how do you know that the state you are getting from a peer is correct? The answer is fast catch-up: since the P6 update, blocks in the last hour of a pay day contain information about the finalization committee for the next pay day. So by obtaining 1 block per day along with its finalization proof, a node can use the committee information in every block to verify the finalization proof for the block of the next day until they reach the newest block, where they can check that the hash of the state they were given corresponds to the hash in the block. In a nutshell: 1 block per day from genesis constitutes a decentralized proof that the state of the chain is correct. Update P7, which should be out in a month or two, makes a change to the block hash structure, which will make such proofs easier. The genesis of the P7 chain will be used as the new genesis for light nodes.

With this change, light nodes should be just a few GB in size and synchronizing with fast catch-up should take minutes. There will then be two types of nodes: the current nodes which can be thought of as archival nodes, and light nodes. Most people are expected to use light nodes, but Concordium will continue running some archival nodes, as well as anyone else who is interested in storing locally the entire history of the chain for business or other purposes.

One thing which we still need to work out, is what kind of services should archival nodes provide. Should they be able to prove something about the chain at any point in the past, e.g., to get a history of transactions for tax purposes or prove that a register data transaction happened at a certain time? Or will an indexer do this? Will the indexer provide a proof that data is correct or do we expect people who want to be sure that the answer is correct to run an archival node? Note that for the same reason why P7 would be the new genesis for light nodes, it will be difficult to get compact proofs for queries before P7. (We consider 1 block per day to be a compact proof. Using SNARKs to get something even more compact is not currently being considered.)

NHS · September 17, 2024, 2:58pm

That makes a lot of sense. I agree that indexers can act as history providers, even though they are off-chain solutions. The data can always be verified through one of the archive nodes. However, it may be necessary to run a certain number of archive nodes to ensure that the history is not lost in case many or all of them go offline.

VikingTechGuy · September 18, 2024, 3:39am

Integrating Light Node, Fast Catch-up, and --hotsync for a Unified Node Approach

Dear @chportma,

Thank you for your insightful response and for sharing the “Light Node & Fast Catch-up” proposal. I fully agree with the concept of light nodes to address node size and synchronization challenges. Building upon your ideas, I propose merging the light node concept into the --hotsync proposal to develop a unified node model. This would create a versatile node type that adapts to different network needs and operator resources.

Proposing a Unified Node Model with Self-Healing Mechanism

By combining the light node concept with the --hotsync proposal, we can create a unified node type that functions in two modes: light and full archival. Here’s how this model would work:

Initial Synchronization as a Light Node: When a new node starts, it operates as a light node, synchronizing only the current state and minimal history. This allows the node to become operational within minutes, facilitating immediate participation in network activities.
Parallel Full Sync Process: In the background, the node begins synchronizing the full blockchain data, allowing operators to choose whether to remain a light node or transition into a full archival node over time. This process runs without interrupting core functions.
Dynamic Mode Switching with Self-Healing: Nodes can switch seamlessly between light and full modes based on configurable settings or resource availability. This self-healing mechanism makes nodes adaptable to network conditions and operator preferences, enhancing overall robustness.

Decentralized Data Assembly with Geographical Optimization

To further decentralize and optimize data retrieval, the node would sync data from multiple peers rather than relying on a single source:

Data Packages from Multiple Peers: Nodes can assemble data from multiple sources to balance network load and prevent centralization risks. Employing cryptographic verification methods ensures data integrity and guards against malicious data.
Geographical Region Settings: Nodes could prioritize syncing from nearby peers, reducing latency and balancing traffic geographically to improve performance.

Leveraging the Upcoming P7 Update

The P7 update’s changes to the block hash structure provide an excellent opportunity to implement these enhancements:

Fast Catch-up with Finalization Proofs: By downloading one block per day from P7 onwards, nodes can verify the chain’s state securely and quickly without downloading the entire history.
Simplified Synchronization: Using the P7 genesis block as the starting point for light nodes simplifies synchronization and improves overall network resilience.

Addressing Archival Nodes and Indexers

While the focus is on light nodes, archival nodes remain crucial for preserving the full blockchain history. @NHS raised an important point about the need for sufficient archival nodes to ensure history isn’t lost if several nodes go offline. Here’s how we factor this into the proposal:

Indexers as History Providers: Indexers can act as off-chain history providers, offering quick access to historical data. However, the data from indexers can always be verified against archival nodes, maintaining trust and accuracy.
Ensuring Sufficient Archival Nodes: We should encourage running multiple independent archival nodes across the network to maintain data redundancy and ensure history preservation. Entities interested in historical data, such as businesses, could be incentivized to operate archival nodes. Implementing a replication strategy where archival nodes mirror each other can further ensure the resilience of historical data.
Community Involvement in Archival Nodes: By emphasizing the importance of archival nodes for long-term data access (e.g., tax compliance, proof of historical transactions), we can encourage community members and organizations to contribute to the network by operating archival nodes.

Summary of the Unified Node Model

This unified node model combines the advantages of both light and full nodes, offering a flexible solution that adapts to the needs of the Concordium network:

Simplified Node Management: A single node type that can operate in light or full mode reduces complexity for operators.
Enhanced Network Participation: Lowering the entry barrier allows more participants to run nodes, strengthening the network’s decentralization and security.
Optimized Resource Utilization: Nodes can dynamically adjust based on available resources, ensuring efficient use of network and hardware capacity.
Decentralization and Redundancy: By obtaining data from multiple peers and running sufficient archival nodes, we enhance decentralization and ensure network resilience.

Summary

By merging the light node and --hotsync concepts, we can create a unified and flexible node model that improves user experience, network scalability, and decentralization. This proposal addresses immediate needs, such as reducing sync times and lowering node size, while maintaining the integrity of the network with archival nodes and finalization proofs. The self-healing mechanism allows nodes to start quickly as light nodes and transition seamlessly into full nodes, ensuring both accessibility and robustness.

We welcome any feedback or suggestions on this approach and are eager to collaborate further to refine these ideas and contribute to Concordium’s continued growth and resilience.

Thank you once again for your guidance and collaboration. We look forward to your response.

VikingTechGuy · September 18, 2024, 3:53am

Proposal for Implementing the Unified Node Model with --hotsync Feature in Concordium’s Haskell Codebase

Introduction

To implement the unified node model, as proposed, with the --hotsync feature in Concordium’s Haskell codebase, we propose a conceptual plan that addresses the following core areas:

Initial Synchronization as a Light Node
Parallel Full Synchronization Process
Dynamic Mode Switching between Light and Full Modes
Decentralized Data Retrieval with Geographical Optimization
Data Integrity Verification and Error Handling
Testing and Benchmarking

This plan emphasizes security, resource management, and robustness, ensuring that the node operates efficiently and securely while enhancing user experience and network participation.

1. Initial Synchronization as a Light Node

We begin by creating a light node that synchronizes only the minimal blockchain state and history to allow rapid startup. The light node will retrieve the minimal necessary data, such as the current ledger state, account balances, and chain parameters, and verify this data using finalization proofs starting from the P7 genesis block.

Code Integration:

Haskell Module: Concordium.Node.LightSync
Functions:
- syncLightNode :: IO ()
- downloadMinimalState :: IO MinimalState
- verifyFinalizationProofs :: MinimalState -> IO Bool

haskell

module Concordium.Node.LightSync where

import Concordium.Blockchain (downloadMinimalState, verifyFinalizationProofs)
import Concordium.Logging (logInfo, logError)
import Control.Exception (catch)
import System.Exit (exitFailure)

-- Function to start light node synchronization
syncLightNode :: IO ()
syncLightNode = do
    logInfo "Starting light node synchronization..."
    minimalState <- downloadMinimalState `catch` handleDownloadError
    isValid <- verifyFinalizationProofs minimalState
    if isValid
        then do
            logInfo "Light node sync complete and state verified."
            -- Node is now operational as a light node
            startLightNodeServices minimalState
        else do
            logError "State verification failed. Exiting."
            exitFailure

-- Function to download minimal state
downloadMinimalState :: IO MinimalState
downloadMinimalState = do
    -- Implement network communication and peer selection
    -- Handle retries and timeouts
    undefined

-- Function to verify finalization proofs
verifyFinalizationProofs :: MinimalState -> IO Bool
verifyFinalizationProofs minimalState = do
    -- Implement cryptographic validation using finalization proofs
    undefined

-- Error handler for download errors
handleDownloadError :: IOError -> IO MinimalState
handleDownloadError e = do
    logError $ "Error downloading minimal state: " ++ show e
    -- Implement retry logic or exit
    undefined

-- Function to start light node services
startLightNodeServices :: MinimalState -> IO ()
startLightNodeServices minimalState = do
    -- Start necessary services for light node operation
    undefined

Details:

Modular Design: The synchronization process is divided into smaller functions for better maintainability.
Security: Cryptographic validation is implemented to prevent attacks.
Error Handling: Network failures and data corruption are gracefully handled with retries and error logging.
Logging: Progress indicators assist in monitoring synchronization status.
Configuration Options: Operators can specify parameters like maximum retries or preferred peers.

2. Parallel Full Synchronization Process

To allow for eventual transition into a full archival node, we implement an asynchronous process that fetches the full chain history in the background while the light node remains functional. This ensures the node’s performance for current operations isn’t degraded during the full sync.

Code Integration:

Haskell Module: Concordium.Node.FullSync
Functions:
- syncFullNodeAsync :: IO ()
- downloadFullHistory :: IO ()

haskell

module Concordium.Node.FullSync where

import Control.Concurrent.Async (async)
import Concordium.Blockchain (downloadFullHistory)
import Concordium.Logging (logInfo)
import System.Process (nice)

-- Asynchronous function to sync full node in the background
syncFullNodeAsync :: IO ()
syncFullNodeAsync = do
    logInfo "Starting full node synchronization in background..."
    setProcessLowPriority
    task <- async downloadFullHistory
    monitorSyncProgress task
    wait task
    logInfo "Full node synchronization complete."

-- Function to download full history
downloadFullHistory :: IO ()
downloadFullHistory = do
    -- Implement data retrieval with validation
    undefined

-- Function to set lower process priority
setProcessLowPriority :: IO ()
setProcessLowPriority = do
    -- Adjust process priority to prevent resource contention
    nice 10
    return ()

-- Function to monitor sync progress
monitorSyncProgress :: Async () -> IO ()
monitorSyncProgress task = do
    -- Implement progress monitoring and logging
    undefined

Details:

Resource Management: Assigns lower priority to prevent resource contention.
Data Integrity: Validates data to prevent corrupted or malicious input.
Progress Monitoring: Tracks and reports sync progress.
Error Recovery: Handles interruptions gracefully with the ability to resume.

3. Dynamic Mode Switching

To support seamless switching between light and full node modes, we introduce functions that dynamically manage the node’s mode based on operator preferences and resource availability.

Code Integration:

Haskell Module: Concordium.Node.ModeSwitch
Functions:
- switchNodeMode :: NodeMode -> IO ()
- checkFullSyncStatus :: IO Bool

haskell

module Concordium.Node.ModeSwitch where

import Concordium.Node.LightSync (syncLightNode)
import Concordium.Node.FullSync (syncFullNodeAsync)
import Concordium.Logging (logInfo, logError)

data NodeMode = LightMode | FullMode

-- Function to switch between light and full modes
switchNodeMode :: NodeMode -> IO ()
switchNodeMode mode = do
    case mode of
        LightMode -> do
            logInfo "Switching to light mode..."
            syncLightNode
        FullMode -> do
            logInfo "Switching to full mode..."
            isSynced <- checkFullSyncStatus
            if isSynced
                then do
                    logInfo "Transitioning to full mode."
                    startFullNodeServices
                else do
                    logError "Full sync not complete. Continuing in light mode."
                    syncFullNodeAsync

-- Function to check if full sync is complete
checkFullSyncStatus :: IO Bool
checkFullSyncStatus = do
    -- Verify that the full history is synced and valid
    undefined

-- Function to start full node services
startFullNodeServices :: IO ()
startFullNodeServices = do
    -- Start necessary services for full node operation
    undefined

Details:

Validation Before Switching: Ensures full sync completion and data integrity.
Fallback Mechanism: Continues in light mode if full sync isn’t complete.
User Control: Operators manage mode switching via settings or commands.
Notifications: Logs mode changes and sync status.

4. Decentralized Data Retrieval and Geographical Optimization

Nodes retrieve data from multiple peers, prioritizing those based on proximity, reliability, and performance metrics, to distribute network load and reduce latency.

Code Integration:

Haskell Module: Concordium.Node.DataRetrieval
Functions:
- fetchDataFromPeers :: IO Data
- discoverAndPrioritizePeers :: IO [Peer]
- fetchDataFromMultiplePeers :: [Peer] -> IO [DataChunk]

haskell

module Concordium.Node.DataRetrieval where

import Concordium.Network (Peer, getAvailablePeers, fetchData)
import Concordium.Logging (logInfo)
import Control.Concurrent.Async (mapConcurrently)

-- Function to fetch data from multiple peers
fetchDataFromPeers :: IO Data
fetchDataFromPeers = do
    logInfo "Fetching data from peers..."
    peers <- discoverAndPrioritizePeers
    dataChunks <- fetchDataFromMultiplePeers peers
    let combinedData = combineDataChunks dataChunks
    isValid <- verifyDataIntegrity combinedData
    if isValid
        then return combinedData
        else do
            logError "Data integrity verification failed."
            undefined

-- Function to discover and prioritize peers
discoverAndPrioritizePeers :: IO [Peer]
discoverAndPrioritizePeers = do
    allPeers <- getAvailablePeers
    let prioritizedPeers = prioritizePeers allPeers
    return prioritizedPeers

-- Function to fetch data from multiple peers concurrently
fetchDataFromMultiplePeers :: [Peer] -> IO [DataChunk]
fetchDataFromMultiplePeers peers = mapConcurrently fetchDataFromPeer peers

-- Helper functions
fetchDataFromPeer :: Peer -> IO DataChunk
fetchDataFromPeer peer = undefined

combineDataChunks :: [DataChunk] -> Data
combineDataChunks chunks = undefined

verifyDataIntegrity :: Data -> IO Bool
verifyDataIntegrity data = undefined

prioritizePeers :: [Peer] -> [Peer]
prioritizePeers peers = undefined

Details:

Peer Discovery and Prioritization: Selects and ranks peers based on key metrics.
Data Integrity and Security: Authenticates and verifies data to prevent malicious inputs.
Dynamic Adjustments: Monitors and adjusts peer prioritization as needed.
Fallback Options: Utilizes distant peers or fallback servers if necessary.
Load Balancing: Distributes requests to avoid overloading peers.

5. Data Integrity Verification and Error Handling

We integrate checksum validation, cryptographic verification, and robust error handling to ensure data integrity and reliability during synchronization.

Code Integration:

Haskell Module: Concordium.Node.DataValidation
Functions:
- validateDataWithChecksum :: Data -> IO Bool
- validateAndRetrySync :: IO Data
- handleValidationFailure :: Int -> IO Data

haskell

module Concordium.Node.DataValidation where

import Concordium.Logging (logInfo, logError)
import Crypto.Hash.SHA256 (hashlazy)
import Control.Concurrent (threadDelay)
import System.Exit (exitFailure)

-- Function to validate data with checksum
validateDataWithChecksum :: Data -> IO Bool
validateDataWithChecksum data = do
    let computedHash = hashlazy data
    let isValid = computedHash == expectedHash
    return isValid

-- Function to validate data and retry on failure
validateAndRetrySync :: IO Data
validateAndRetrySync = handleValidationFailure maxRetries

-- Helper function to handle validation failures with retries
handleValidationFailure :: Int -> IO Data
handleValidationFailure retries
    | retries <= 0 = do
        logError "Failed to validate data after maximum retries."
        exitFailure
    | otherwise = do
        logInfo $ "Retrying data synchronization. Retries left: " ++ show retries
        data <- fetchDataFromPeers
        isValid <- validateDataWithChecksum data
        if isValid
            then do
                logInfo "Data validated successfully."
                return data
            else do
                logError "Data validation failed. Retrying..."
                let delay = (maxRetries - retries + 1) * baseDelay
                threadDelay (delay * 1000000)
                handleValidationFailure (retries - 1)
  where
    maxRetries = 5
    baseDelay = 5
    expectedHash = undefined

Details:

Cryptographic Validation: Uses SHA-256 for strong validation.
Retry Strategy: Implements exponential backoff to prevent resource exhaustion.
Maximum Retries: Sets sensible limits and alerts operators on failure.
Operator Notifications: Logs persistent validation failures.
Error Logging: Maintains detailed logs for debugging.

6. Testing and Benchmarking

Thorough testing and benchmarking ensure that the optimizations work efficiently without degrading performance. We implement comprehensive tests and collect performance metrics for ongoing optimization.

Code Integration:

Haskell Module: Concordium.Node.Benchmark
Functions:
- benchmarkSyncProcess :: IO ()
- collectPerformanceMetrics :: IO Metrics

haskell

module Concordium.Node.Benchmark where

import Concordium.Node.FullSync (syncFullNodeAsync)
import Concordium.Logging (logInfo)
import Data.Time.Clock (getCurrentTime, diffUTCTime)
import Concordium.Metrics (recordMetric, exportMetrics)

-- Function to benchmark synchronization process
benchmarkSyncProcess :: IO ()
benchmarkSyncProcess = do
    logInfo "Starting synchronization benchmark..."
    startTime <- getCurrentTime
    syncFullNodeAsync
    endTime <- getCurrentTime
    let duration = diffUTCTime endTime startTime
    logInfo $ "Full synchronization completed in " ++ show duration
    recordMetric "full_sync_duration" duration
    collectPerformanceMetrics
    exportMetrics

-- Function to collect performance metrics
collectPerformanceMetrics :: IO Metrics
collectPerformanceMetrics = do
    -- Collect metrics such as CPU usage, memory consumption, etc.
    undefined

-- Function to export metrics for visualization
exportMetrics :: IO ()
exportMetrics = do
    -- Export metrics for analysis
    undefined

Details:

Comprehensive Testing: Includes various scenarios and stress testing.
Metrics Collection: Gathers data on performance and resource utilization.
Integration into Development Pipeline: Benchmarks are part of ongoing development.
Visualization Tools: Uses graphs and charts for analysis.
Community Feedback: Shares results for collaborative improvement.
Optimization: Identifies bottlenecks for code enhancement.

Summary

The proposed implementation of the unified node model with the --hotsync feature addresses key challenges related to node size, synchronization time, and network decentralization. By integrating light node functionality, asynchronous full synchronization, dynamic mode switching, decentralized data retrieval, data integrity verification, and comprehensive testing, we enhance node performance, security, and user experience.

This implementation plan carefully considers resource management, error handling, and security. It ensures that nodes can rapidly become operational as light nodes, with the option to transition to full archival nodes over time, adapting to operator preferences and resource availability.

We propose proceeding by developing prototypes for each module, starting with the light node synchronization, and incrementally integrating and testing each component. Collaboration with the Concordium development team and community will be invaluable in refining the implementation and ensuring alignment with the project’s goals and standards.

NHS · September 18, 2024, 5:03am

It’s sensible to avoid having two different node types. It should be configurable. On a broader note, I believe it’s time for Concordium to invest and involve the community as a development resource. Frankly, the current pace of implementation of changes needed to get things moving on proposals and the roadmap in general is moving too slowly.

VikingTechGuy · September 19, 2024, 1:44pm