UTF-8 is an encoding for Unicode text with several nice properties:
- Any Unicode string can be encoded
- For pure ASCII text (characters under 128), the UTF-8 encoded form is the same as the ASCII form
- It’s pretty well universally recognised as a text-interchange format on the Internet
- Random byte streams are unlikely to be misinterpreted as UTF-8
It’s this last property I want to talk about here.
Let p(n) be the probability that a given byte stream can be validly interpreted as UTF-8. For instance, p(1) = 1/2, because the only valid UTF-8 byte streams with only one byte are the ones with the most significant bit not set. I want to compute p(16), the probability that a random 16 byte stream can be interpreted as UTF-8.
Method 1: Monte-Carlo
Java 5 introduces new classes (Charset and CharsetDecoder) that allow us to see whether a given byte stream is valid for a given character set. The technique I’m using is:
- Compute a number of byte streams at random
- Use CharsetDecoder to see whether they validate
- Estimate the probability by dividing the number of valid strings by the size of the sample
When I ran below, I got an estimate of 1.025e-04 for p(16). The code is here:
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.CodingErrorAction;
import java.util.Random;
public class UTF8MonteCarlo
{
// Size of sample
private static final int MAX = 10000000;
public static void main(String[] args)
{
double probEstimate = monteCarloCharSetProbability(16, MAX, "UTF-8");
System.out.printf("%e%n", probEstimate);
}
private static double monteCarloCharSetProbability(int byteStreamSize, int sampleSize, String charSetName)
{
int validStrings = 0;
Charset charset = Charset.forName(charSetName);
CharsetDecoder decoder = charset.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
byte[] b = new byte[byteStreamSize];
CharBuffer charBuffer = CharBuffer.allocate(byteStreamSize);
Random random = new Random();
for (int k = 0; k < MAX; k++) {
random.nextBytes(b);
CoderResult result = decoder.decode(ByteBuffer.wrap(b), charBuffer, true);
charBuffer.clear();
if (!result.isMalformed() && !result.isUnmappable()) {
validStrings++;
}
}
return (double)validStrings / sampleSize;
}
}
Method 2: Recurrence relation
Using Wikipedia’s description of the encoding, we can derive a recurrence relation for p(n):
p(n) = p(n-1) / 2 + p(n-2) / 32 + p(n-3)/256 + p(n-4) / 2048
where we can define p(0) = 1, p(-1) = 0, p(-2) = 0, p(-3) = 0.
Using this formula, we can compute a value of 1.063853e-04 for p(16). The code is here:
public class UTF8Recurrence
{
private final static int MAX_SIZE = 16;
public static void main(String[] args)
{
double p[] = new double[MAX_SIZE + 4];
p[0] = 0d;
p[1] = 0d;
p[2] = 0d;
p[3] = 1d;
for (int i = 1; i <= MAX_SIZE; i++) {
p[i + 3] = p[i + 2] / 2 + p[i + 1] / 32 + p[i] / 256 + p[i - 1] / 2048;
System.out.printf("%e%n", p[i + 3]);
}
}
}
The probabilities for larger values are shown here:
n | p |
---|---|
16 | 1.063853e-04 |
32 | 1.304427e-08 |
64 | 1.961083e-16 |
128 | 4.432493e-32 |
Andrew
/ September 30, 2011Thanks for this! I’ve been curious what this probability was (and how to work it out) and find it interesting that it is as low as 0.01% for a random sequence of just 16 bytes. I recently had to write some code to process multi-lingual csv files which involved figuring out the encoding from the first ‘few’ bytes of the file – hence my curiosity.
You should add this information to the wikipedia page on utf-8 (http://en.wikipedia.org/wiki/UTF-8). It currently only contains an estimate of the probability of a random set of NON-ASCII bytes being valid UTF-8, which isn’t very informative.
Random
/ April 2, 2015Someone gave an mathematical answer.
http://math.stackexchange.com/questions/750551/probability-that-random-byte-array-is-a-valid-utf-8-string