Stackoverflow
Blogroll
- The Old New Thing
- xkcd
Maths
- Atlas of Characteristic Zero Representations
- My old research page on finite groups
Search For

Search for:
Archives
Tags

cats coding dvd java logging open source personal podcast probability puzzle rant shotofjaq unicode usability utf-8

Is a random set of bytes a UTF-8 string?

UTF-8 is an encoding for Unicode text with several nice properties:

Any Unicode string can be encoded
For pure ASCII text (characters under 128), the UTF-8 encoded form is the same as the ASCII form
It’s pretty well universally recognised as a text-interchange format on the Internet
Random byte streams are unlikely to be misinterpreted as UTF-8

It’s this last property I want to talk about here.

Let p(n) be the probability that a given byte stream can be validly interpreted as UTF-8. For instance, p(1) = 1/2, because the only valid UTF-8 byte streams with only one byte are the ones with the most significant bit not set. I want to compute p(16), the probability that a random 16 byte stream can be interpreted as UTF-8.

Method 1: Monte-Carlo

Java 5 introduces new classes (Charset and CharsetDecoder) that allow us to see whether a given byte stream is valid for a given character set. The technique I’m using is:

Compute a number of byte streams at random
Use CharsetDecoder to see whether they validate
Estimate the probability by dividing the number of valid strings by the size of the sample

When I ran below, I got an estimate of 1.025e-04 for p(16). The code is here:


import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.CodingErrorAction;
import java.util.Random;

public class UTF8MonteCarlo
{
 // Size of sample
 private static final int MAX = 10000000;

 public static void main(String[] args)
 {
  double probEstimate = monteCarloCharSetProbability(16, MAX, "UTF-8");
  System.out.printf("%e%n", probEstimate);
 }

 private static double monteCarloCharSetProbability(int byteStreamSize, int sampleSize, String charSetName)
 {
  int validStrings = 0;
  Charset charset = Charset.forName(charSetName);
  CharsetDecoder decoder = charset.newDecoder();
  decoder.onMalformedInput(CodingErrorAction.REPORT);
  byte[] b = new byte[byteStreamSize];
  CharBuffer charBuffer = CharBuffer.allocate(byteStreamSize);
  Random random = new Random();
  for (int k = 0; k < MAX; k++) {
   random.nextBytes(b);
   CoderResult result = decoder.decode(ByteBuffer.wrap(b), charBuffer, true);
   charBuffer.clear();
   if (!result.isMalformed() && !result.isUnmappable()) {
    validStrings++;
   }
  }
  return (double)validStrings / sampleSize;

 }
}

Method 2: Recurrence relation

Using Wikipedia’s description of the encoding, we can derive a recurrence relation for p(n):

p(n) = p(n-1) / 2 + p(n-2) / 32 + p(n-3)/256 + p(n-4) / 2048

where we can define p(0) = 1, p(-1) = 0, p(-2) = 0, p(-3) = 0.

Using this formula, we can compute a value of 1.063853e-04 for p(16). The code is here:


public class UTF8Recurrence
{
 private final static int MAX_SIZE = 16;
 public static void main(String[] args)
 {
  double p[] = new double[MAX_SIZE + 4];
  p[0] = 0d;
  p[1] = 0d;
  p[2] = 0d;
  p[3] = 1d;
  for (int i = 1; i <= MAX_SIZE; i++) {
   p[i + 3] = p[i + 2] / 2 + p[i + 1] / 32 + p[i] / 256 + p[i - 1] / 2048;
   System.out.printf("%e%n", p[i + 3]);
  }
 }
}

The probabilities for larger values are shown here:

n	p
16	1.063853e-04
32	1.304427e-08
64	1.961083e-16
128	4.432493e-32

3 Comments

by Simon on January 30, 2010 • Permalink

Posted in Uncategorized

Tagged coding, java, probability, unicode, utf-8

Posted by Simon on January 30, 2010

https://nickerson.org.uk/blog/2010/is-a-random-set-of-bytes-a-utf-8-string/

3 Comments

Andrew
/ September 30, 2011

Thanks for this! I’ve been curious what this probability was (and how to work it out) and find it interesting that it is as low as 0.01% for a random sequence of just 16 bytes. I recently had to write some code to process multi-lingual csv files which involved figuring out the encoding from the first ‘few’ bytes of the file – hence my curiosity.

You should add this information to the wikipedia page on utf-8 (http://en.wikipedia.org/wiki/UTF-8). It currently only contains an estimate of the probability of a random set of NON-ASCII bytes being valid UTF-8, which isn’t very informative.
Random
/ April 2, 2015

Someone gave an mathematical answer.
http://math.stackexchange.com/questions/750551/probability-that-random-byte-array-is-a-valid-utf-8-string

How to check whether the character is utf-8 - MicroEducate

Simon Nickerson: Coding Blog

Software development and mathematics

Stackoverflow

Blogroll

Maths

Search For

Archives

Is a random set of bytes a UTF-8 string?

Method 1: Monte-Carlo

Method 2: Recurrence relation

3 Comments

Andrew

Random

Leave a Reply Cancel reply

Categories

Archives

Meta

Simon Nickerson: Coding Blog

Software development and mathematics

Stackoverflow

Blogroll

Maths

Search For

Archives

Tags

Is a random set of bytes a UTF-8 string?

Method 1: Monte-Carlo

Method 2: Recurrence relation

3 Comments

Andrew

Random

Leave a Reply Cancel reply

Categories

Archives

Meta