Computing the probability of generating conflicting random strings

At work today, I was faced with the problem of generating unique, random identifiers for wire transfers that would fit inside of an ACH descriptor field. An ACH descriptor may contain at most 10 upper-case characters; for the random identifiers, we wished to restrict ourselves to letters only. However, with so few characters available in such a short string, the risk of generating conflicting IDs seemed significant, so I did some napkin math and quickly found that there was about a 50% chance of generating a conflict within 10 million identifiers. Not really a long-term scalable solution. I was particulary curious about whether adding digits to the IDs would result in a quantifiable improvement to the collision probability.

After getting home, I did the math more rigorously on a notepad. First, preliminaries: there are \(n = 26^{10}\) possible strings we can generate. We consider a trial of length \(k\) to be a sequence of events where we select \(k - 1\) distinct IDs, followed by one duplicate ID, at which point the trial ends. Here is our more precisely formulated question: what is the expected value \(E[k]\) of a randomly generated trial? This will tell us about the order of magnitude of the number of IDs we can generate before duplicates become a reality.

The expected value is given by the formula \begin{equation} E[k] = \sum_{k=1}^\infty p_{n,k} k, \label{expectdef} \end{equation} where \(p_{n,k}\) is the probability of a trial of length \(k\) occurring. We can figure out \(p_{n,k}\) with some easy combinatorics. To generate a trial of length \(k\), we first have to select \(k - 1\) strings without duplicates. There are \(n\) ways to choose the first string, \(n - 1\) ways to choose the second string, and so on. The total number of ways to choose the first \(k - 1\) strings is then \[n (n - 1) (n - 2) \cdots (n - k + 2).\] (Notice that there are \(k - 1\) terms in this product.) The last string can be any of the first \(k - 1\) strings, so the overall number of outcomes for a trial of length \(k\) is \[n (n - 1) \cdots (n - k + 2) (k - 1).\] The total number of ways to select \(k\) strings, with or without duplicates, is \(n^k\), so the odds of such a trial of length \(k\) occurring are \begin{equation} p_{n,k} = \frac{n (n - 1) \cdots (n - k + 2) (k - 1)}{n^k}. \label{prob} \end{equation} I'm glossing over some nuance in precisely defining the state space and probability function, but the formula derived by this procedure is correct.

Inserting \eqref{prob} into \eqref{expectdef} yields \begin{equation} E[k] = \sum_{k=1}^\infty k (k - 1) \frac{n^{(k - 1)}}{n^k}, \label{expectbase} \end{equation} where the falling Pochhammer symbol \(n^{(k)}\) is defined by \[n^{(k)} \equiv n (n - 1) \cdots (n - k + 1).\] (The convention used here is the reverse of Wikipedia.) Defining the rising Pochhammer symbol \[(n)_k \equiv n (n + 1) \cdots (n + k - 1)\] and using the identity \[n^{(k)} = (-1)^k (-n)_k,\] we rewrite \eqref{expectbase} as \begin{equation} E[k] = \sum_{k=1}^\infty -k (k - 1) (-n)_{k-1} \left(-\frac{1}{n}\right)^{k}. \label{expecthalfrising} \end{equation} We can further make use of the calculation \[k (k - 1) = \frac{k!}{(k - 2)!} = \frac{(1)_k}{(1)_{k-2}}\] to rewrite \eqref{expecthalfrising} as \begin{equation} E[k] = \sum_{k=1}^\infty -\frac{(1)_k (-n)_{k-1}}{(1)_{k-2}} \left(-\frac{1}{n}\right)^{k}. \label{expectrising} \end{equation} Finally, we pull off some leading terms and reindex the sum to get the final form of our solution: \begin{align} \begin{split} E[k] &= \sum_{k=1}^\infty -\frac{(1)_k (-n)_{k-1}}{(1)_{k-2}} \left(-\frac{1}{n}\right)^{k} \\ &= \frac{2}{n} \sum_{k=2}^\infty \frac{(3)_{k-2} (-n + 1)_{k-2}}{(1)_{k-2}} \left(-\frac{1}{n}\right)^{k-2} \\ &= \frac{2}{n} \sum_{k=0}^\infty \frac{(3)_k (-n + 1)_k}{k!} \left(-\frac{1}{n}\right)^k. \end{split} \label{expectseries} \end{align}

The solution takes the form of a hypergeometric series, and can be written more compactly as \[ E[k] = \frac{2}{n}\thinspace{}_2 F_0\left(3, -n + 1;; -\frac{1}{n}\right). \label{expecthypergeom} \] It can also be written in terms of the confluent hypergeometric function U(a, b, z) using a simple identity: \[ E[k] = 2 n^2\thinspace U(3, n + 3, n). \label{expectconfluent} \] A very similar calculation gives an expression for \(E[k^2]\): \begin{equation} E[k^2] = \frac{4}{n}\thinspace{}_3 F_1\left(3, 3, -n + 1; 2; -\frac{1}{n}\right). \label{expectk2} \end{equation} By combining \eqref{expectseries} with \eqref{expectk2}, we get an expression for the standard deviation: \[ \sigma = \sqrt{E[k^2] - E[k]^2} = \frac{2}{n} \sqrt{n\thinspace{}_3 F_1\left(3, 3, -n + 1; 2; -\frac{1}{n}\right) - \thinspace{}_2 F_0\left(3, -n + 1;; -\frac{1}{n}\right)^2}. \] The type of calculation done in this blog post is extremely routine, but I wanted to do it in detail in order to give a good example of how to recognize a hypergeometric series and write it in standard form, since many people are not familiar with hypergeometric functions despite their ubiquity in studying probability and differential equations.

To close out, I want to share the exact numerical answer to my problem. I used the `mpmath` Python package to calculate the mean and standard deviation for \(n = 26^{10}\) (letters only) and \(n = 36^{10}\) (letters and numbers).

n \(E[k]\) \(\sqrt{E[k^2] - E[k]^2}\)
\(26^{10}\) \(14,891,097\) \(7,783,921\)
\(36^{10}\) \(42,871,006\) \(32,625,868\)

Evidently, using digits as well as letters does extend our mileage somewhat, but not in a fundamentally major way.

Comments

Popular posts from this blog

Don't forget to add the niche

Between a block and a hard place