Is a random set of bytes a UTF-8 string?

UTF-8 is an encoding for Unicode text with several nice properties:

  • Any Unicode string can be encoded
  • For pure ASCII text (characters under 128), the UTF-8 encoded form is the same as the ASCII form
  • It’s pretty well universally recognised as a text-interchange format on the Internet
  • Random byte streams are unlikely to be misinterpreted as UTF-8

It’s this last property I want to talk about here.

Let p(n) be the probability that a given byte stream can be validly interpreted as UTF-8. For instance, p(1) = 1/2, because the only valid UTF-8 byte streams with only one byte are the ones with the most significant bit not set. I want to compute p(16), the probability that a random 16 byte stream can be interpreted as UTF-8.

Method 1: Monte-Carlo

Java 5 introduces new classes (Charset and CharsetDecoder) that allow us to see whether a given byte stream is valid for a given character set. The technique I’m using is:

  1. Compute a number of byte streams at random
  2. Use CharsetDecoder to see whether they validate
  3. Estimate the probability by dividing the number of valid strings by the size of the sample

When I ran below, I got an estimate of 1.025e-04 for p(16). The code is here:


import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.CodingErrorAction;
import java.util.Random;

public class UTF8MonteCarlo
{
 // Size of sample
 private static final int MAX = 10000000;

 public static void main(String[] args)
 {
  double probEstimate = monteCarloCharSetProbability(16, MAX, "UTF-8");
  System.out.printf("%e%n", probEstimate);
 }

 private static double monteCarloCharSetProbability(int byteStreamSize, int sampleSize, String charSetName)
 {
  int validStrings = 0;
  Charset charset = Charset.forName(charSetName);
  CharsetDecoder decoder = charset.newDecoder();
  decoder.onMalformedInput(CodingErrorAction.REPORT);
  byte[] b = new byte[byteStreamSize];
  CharBuffer charBuffer = CharBuffer.allocate(byteStreamSize);
  Random random = new Random();
  for (int k = 0; k < MAX; k++) {
   random.nextBytes(b);
   CoderResult result = decoder.decode(ByteBuffer.wrap(b), charBuffer, true);
   charBuffer.clear();
   if (!result.isMalformed() && !result.isUnmappable()) {
    validStrings++;
   }
  }
  return (double)validStrings / sampleSize;

 }
}

Method 2: Recurrence relation

Using Wikipedia’s description of the encoding, we can derive a recurrence relation for p(n):

p(n) = p(n-1) / 2 + p(n-2) / 32 + p(n-3)/256 + p(n-4) / 2048

where we can define p(0) = 1, p(-1) = 0, p(-2) = 0, p(-3) = 0.

Using this formula, we can compute a value of 1.063853e-04 for p(16). The code is here:


public class UTF8Recurrence
{
 private final static int MAX_SIZE = 16;
 public static void main(String[] args)
 {
  double p[] = new double[MAX_SIZE + 4];
  p[0] = 0d;
  p[1] = 0d;
  p[2] = 0d;
  p[3] = 1d;
  for (int i = 1; i <= MAX_SIZE; i++) {
   p[i + 3] = p[i + 2] / 2 + p[i + 1] / 32 + p[i] / 256 + p[i - 1] / 2048;
   System.out.printf("%e%n", p[i + 3]);
  }
 }
}

The probabilities for larger values are shown here:

n p
16 1.063853e-04
32 1.304427e-08
64 1.961083e-16
128 4.432493e-32
Leave a comment

2 Comments

  1. Thanks for this! I’ve been curious what this probability was (and how to work it out) and find it interesting that it is as low as 0.01% for a random sequence of just 16 bytes. I recently had to write some code to process multi-lingual csv files which involved figuring out the encoding from the first ‘few’ bytes of the file – hence my curiosity.

    You should add this information to the wikipedia page on utf-8 (http://en.wikipedia.org/wiki/UTF-8). It currently only contains an estimate of the probability of a random set of NON-ASCII bytes being valid UTF-8, which isn’t very informative.

Leave a Reply

Your email address will not be published. Required fields are marked *