Sybil-Resilient Identifier
Proof of Uniqueness
There are many situations where it is valuable to be able to identify a unique individual— elections, token distributions, event tickets—as a Sybil attack would allow a malicious party to gain an unfair advantage. In this write-up, we are going to assess different solutions for Sybil resistant identifiers.
First, let us talk about the model. We assume there are trusted identity providers (IDP) that issue identity credentials to users. The users want to access services for which they have to produce an identifier. The services want that these identifiers are Sybil-resistant, i.e., it should be hard for a user to produce two different identifiers for the same service. On the other hand, users would like privacy. This could mean that the identifier does not reveal the user’s identity or that identifiers are not linkable over different services. As we will see below, it is easy to achieve one of those properties, but getting Sybil-resistance and strong privacy is hard.
Reveal Attributes
We start with the simplest solution. When a user wants to sign-up with a service, they need to reveal an identity_string
, e.g., consisting of[1] first_name
, last_name
and birthdate
, from their identity credential. To prevent a malicious user from cheating, the identity string is accompanied by a zero-knowledge proof showing correctness of the revealed information. If this is the first time the service sees identity_string
, they provide the user with a random identifier uid
(signed by the service), and store uid,identity_string
in their database.
Zero-knowledge proof. Technically, the above zero-knowledge proof shows the statement “I know a signature from the identity provider on the revealed attributes”. Using a suitable signature scheme this can be done with simple Sigma protocols.
Sybil resistance. We first note that the user cannot reveal an incorrect identity string. The zero-knowledge proof ensures that it corresponds to the information in the user’s ID credential and the model assumes that this information is correct (trusted IDPs). The Sybil-resistance now depends on the required identity string. If it is too specific, e.g., contains the passport number, a user might be able to sign up twice (e.g., as they have two passports) or not be able to sign up at all (as they might not have a passport). If it is too loose, e.g., it consists of last_name
only, some users might not be able to sign up as there will be a collision with an existing user’s identity string.
Privacy. The service learns all information contained in the identity string. The user will have to trust the service provider to process and store the information in a diligent manner.
Identity string. As seen above, selecting a good identity string is crucial for Sybil resistance. Here, we discuss the advantages and disadvantages of different options.
- First name, last name, birth date: The main advantage of this option is that it does not depend on nationality (a user might be dual-citizen) or the identity document used to register at the identity provider (a user might have a passport and a driver’s license).
On the flip side, this can increase the likelihood of collisions. For example, with high probability there will be several James Smiths born on September 9th 1980. Transliteration can also cause issues. For example, someone’s name might be spelled Чебышёв or Chebyshev depending on the country issuing an identity document. Even if one insists on using a romanized name this issue persists (e.g. Chebyshev, Tchebychev, Chebyshov, etc.).
Finally, with the components used in this option one easily identifies the real-world person behind the user. - Nationality, ID-document number: The combination of nationality and identity document number should uniquely identify the document the user showed at the identity provider. In that sense, the option provides a good Sybil-resistance.
On the flip side, a user with different identity documents, potentially even from different countries, can easily bypass the Sybil protection. In reality, this might not be as bad, as most people will not have more than 5 such documents (e.g. two passports, one ID card, a driver’s license, and a residence permit).
Privacy-wise, it is harder to identify a user given the information in this option (the ID number is normally not easy to find on the web). However, if the service already has more information on the user (e.g. their name), this option now carries the risk that an (inadvertent) leak might connect the ID number with other attributes of the user.
Reveal Hash of Attributes
We can do better. Let H
be your favorite hash function (e.g. SHA-256). Instead of revealing the identity string to the service, users instead reveal h = H(identity string)
and prove in zero-knowledge that h
has been computed correctly from their identity credential. Depending on the used hash function H
this will require heavier zkSNARK (zero-knowledge succinct argument of knowledge) technology. Ideally, H
is chosen to be a SNARK friendly hash (e.g., Pedersen hash). The rest of the system is kept unchanged.
Zero-knowledge proof. Technically, the zero-knowledge proof shows the statement “I know a signature from the identity provider on attributes of identity string
and h = H(identity string)
”.
Sybil resistance. Exactly the same as in the first solution, i.e., depending on the chosen string, there will either be false positives or false negatives (or both).
Privacy. The service only learns the hash h
of the identity string. While this is better than the first solution, it still has multiple vulnerabilities. If the chosen string is short enough, a brute force attack that tries all combinations can reveal all user data—as in the example in the previous section, where the identity string is composed of the nationality and ID document number. If the string is too long for that—e.g., first name, last name, date of birth—knowing partial user data still allows the brute force attack to recover the rest. It also allows tracking users. For example, two collaborating services can check if the same user signed-up with both of them as the user would reveal the same hash to them. Another issue is the low entropy of the hashed values. If a service knows the identity string attributes of a specific person (which is not too hard to find out), they can check if that person has signed-up with them. The hash h
must therefore not be made public, which would prevent third-parties from carrying out these attacks, but still allows the service providers to learn the users private data.
We remark that due to the hash revealing so much information about the user data, this hash should never be stored on-chain, e.g., this solution cannot be used in a smart contract.
Context Dependent Identifier
The privacy shortcomings of the above solutions is primarily due to the fact that the revealed identifier is globally unique and can be deterministically computed from the identity attributes of the user. We do better by using more cryptography.
Pseudo-random function family (PRF). A PRF takes as input a key k
and a message m
and outputs value r.
Roughly, the guarantee is that for any message m
the output r
appears random if the key k
was chosen randomly. In particular, if the random key k
is unknown, it is hard (for any message m
) to distinguish the output r = PRF(k,m)
from a uniform random value. There exist zero-knowledge friendly PRFs that allow for efficient proofs of correctness (of the PRF computation).
The overall protocol is as follows. The identity provider adds a special uniqueness key to the identity credential. Users can then use this key to generate a context dependent identifier whenever they sign up for a service.
ID credential and uniqueness key. We assume that each identity provider has a PRF key key_IDP
. Whenever they issue an identity credential to a user, they also add the uniqueness key key_user = PRF(key_IDP, identity_string)
to the credential where the identity_string
is the same as in the above. The user needs to keep this key secret.
Context dependent identifier. When a user signs-up to a service, they reveal uid = PRF(key_user, context)
, where context
is a string describing the service context (e.g., “MyFancyAirdrop204”). In addition the user provides a zero-knowledge proof that uid
was computed correctly.
Zero-knowledge proof. Technically, the zero-knowledge proof shows the statement “I know a signature from the identity provider on key_user
and uid = PRF(key_user, context)
. Note that both the signature and key_user
are part of the witness and thus secret.
For a suitable signature scheme and PRF this can be done using Sigma protocols.
Sybil resistance. The solution has the same limitations regarding the identity_string
as the above solutions. In addition, it only provides guarantees for credentials from the same identity provider.
Observe that for a fixed key_user
the user’s identifier for a given context
is fixed. Furthermore, the zero-knowledge proof ensures that the presented identifier has been computed from the key_user
signed by the identity provider. This reduces the Sybil resistance discussion to the uniqueness of key_user
.
For a given key_IDP
and fixed identity_string
the uniqueness key key_user
is fixed. Thus, for a given identity provider the Sybil resistance reduces to the uniqueness of the identity_string
. So as in the above solutions, depending on the chosen string, there will either be false positives or false negatives (or both). For example, if the identity string contains the passport number, a user with two passports will be able to get two different key_user
which allows them to break Sybil resistance. On the other hand, if the identity string is just the first name, then two users called Alice will both get the same uniqueness key.
Finally, for two different identity providers (which have different key_IDP
) a user will get (with overwhelming probability) different uniqueness keys even if they sign-up with the same identity string.
Privacy. In a nutshell, the use of a PRF ensures that the revealed identifier uid
appears random, preventing user tracking. Given two identifiers from different contexts, it is infeasible to test whether they are from the same user or not. Furthermore, even if one knows the identity_string
of a user, one cannot compute uid
without knowing key_user
. This prevents the brute-force attacks described above.
Note that it is safe to use these identifiers on-chain. For example, a user can publish their uid
in a smart contract to participate in an airdrop.
Legacy credentials. This solution requires a new type of attribute in identity credentials. To facilitate the case where a large user base already has “normal” identity credentials one can instead issue an addon credential that provides the pseudonymity key. If this issuer for this addon credential is unique, using only this issuer instead of the IDPs it also solves the problem of different IDPs providing different uid
s to the same user.
One needs to take care when crafting the identity string from variable length strings. Simple concatenation can cause issues as e.g. AA||B = A||AB. So instead one could concatenate the hashes of the different parts, i.e.,
identity_string = H(str_1)||..||H(str_n)
. ↩︎