com.google.json.JsonSanitizer Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of json-sanitizer Show documentation
Given JSON-like content, converts it to valid JSON. This can be attached at either end of a data-pipeline to help satisfy Postel's principle: be conservative in what you do, be liberal in what you accept from others Applied to JSON-like content from others, it will produce well-formed JSON that should satisfy any parser you use. Applied to your output before you send, it will coerce minor mistakes in encoding and make it easier to embed your JSON in HTML and XML.
There is a newer version: 1.2.3
Show newest version
// Copyright (C) 2012 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package com.google.json;

import java.math.BigInteger;

/**
 * Given JSON-like content, converts it to valid JSON.
 * This can be attached at either end of a data-pipeline to help satisfy
 * Postel's principle:
 * 
 * be conservative in what you do, be liberal in what you accept from others
 * 
 * 
 * Applied to JSON-like content from others, it will produce well-formed JSON
 * that should satisfy any parser you use.
 * 

 * Applied to your output before you send, it will coerce minor mistakes in
 * encoding and make it easier to embed your JSON in HTML and XML.
 *
 * 
Input
 * The sanitizer takes JSON like content, and interprets it as JS eval would.
 * Specifically, it deals with these non-standard constructs.
 * 
 * {@code '...'} Single quoted strings are converted to JSON strings.
 * 
{@code \xAB} Hex escapes are converted to JSON unicode escapes.
 * 
{@code \012} Octal escapes are converted to JSON unicode escapes.
 * 
{@code 0xAB} Hex integer literals are converted to JSON decimal numbers.
 * 
{@code 012} Octal integer literals are converted to JSON decimal numbers.
 * 
{@code +.5} Decimal numbers are coerced to JSON's stricter format.
 * 
{@code [0,,2]} Elisions in arrays are filled with {@code null}.
 * 
{@code [1,2,3,]} Trailing commas are removed.
 * 
{foo:"bar"} Unquoted property names are quoted.
 * 
//comments JS style line and block comments are removed.
 * 
(...) Grouping parentheses are removed.
 * 
 *
 * The sanitizer fixes missing punctuation, end quotes, and mismatched or
 * missing close brackets. If an input contains only white-space then the valid
 * JSON string {@code null} is substituted.
 *
 * Output
 * The output is well-formed JSON as defined by
 * RFC 4627.
 * The output satisfies three additional properties:
 * 
 * The output will not contain the substring (case-insensitively)
 *   {@code "The output will not contain the substring {@code "]]>"} so can be
 *   embedded inside an XML CDATA section without further encoding.
 * The output is a valid Javascript expression, so can be parsed by
 *   Javascript's eval builtin (after being wrapped in parentheses)
 *   or by JSON.parse.
 *   Specifically, the output will not contain any string literals with embedded
 *   JS newlines (U+2028 Paragraph separator or U+2029 Line separator).
 * 
The output contains only valid Unicode scalar values
 *   (no isolated UTF-16 surrogates) that are
 *   allowed in XML unescaped.
 * 
 *
 * Security
 * Since the output is well-formed JSON, passing it to eval will
 * have no side-effects and no free variables, so is neither a code-injection
 * vector, nor a vector for exfiltration of secrets.
 *
 * This library only ensures that the JSON string → Javascript object
 * phase has no side effects and resolves no free variables, and cannot control
 * how other client side code later interprets the resulting Javascript object.
 * So if client-side code takes a part of the parsed data that is controlled by
 * an attacker and passes it back through a powerful interpreter like
 * {@code eval} or {@code innerHTML} then that client-side code might suffer
 * unintended side-effects.
 *
 * 
Efficiency
 * The sanitize method will return the input string without allocating a new
 * buffer when the input is already valid JSON that satisfies the properties
 * above.  Thus, if used on input that is usually well formed, it has minimal
 * memory overhead.
 * The sanitize method takes O(n) time where n is the length in UTF-16
 * code-units.
 */
public final class JsonSanitizer {

  /** The default for the maximumNestingDepth constructor parameter. */
  public static final int DEFAULT_NESTING_DEPTH = 64;

  /** The maximum value for the maximumNestingDepth constructor parameter. */
  public static final int MAXIMUM_NESTING_DEPTH = 4096;

  /**
   * Given JSON-like content, produces a string of JSON that is safe to embed,
   * safe to pass to JavaScript's {@code eval} operator.
   *
   * @param jsonish JSON-like content.
   * @return embeddable JSON
   */
  public static String sanitize(String jsonish) {
    return sanitize(jsonish, DEFAULT_NESTING_DEPTH);
  }

  /**
   * Same as {@link JsonSanitizer#sanitize(String)}, but allows to set a custom
   * maximum nesting depth.
   *
   * @param jsonish JSON-like content.
   * @param maximumNestingDepth maximum nesting depth.
   * @return embeddable JSON
   */
  public static String sanitize(String jsonish, int maximumNestingDepth) {
    JsonSanitizer s = new JsonSanitizer(jsonish, maximumNestingDepth);
    s.sanitize();
    return s.toString();
  }

  /**
   * Describes where we are in a state machine that consists of transitions on
   * complete values, colons, commas, and brackets.
   */
  private enum State {
    /**
     * Immediately after '[' and
     * {@link #BEFORE_ELEMENT before the first element}.
     */
    START_ARRAY,
    /** Before a JSON value in an array or at the top level. */
    BEFORE_ELEMENT,
    /**
     * After a JSON value in an array or at the top level, and before any
     * following comma or close bracket.
     */
    AFTER_ELEMENT,
    /** Immediately after '{' and {@link #BEFORE_KEY before the first key}. */
    START_MAP,
    /** Before a key in a key-value map. */
    BEFORE_KEY,
    /** After a key in a key-value map but before the required colon. */
    AFTER_KEY,
    /** Before a value in a key-value map. */
    BEFORE_VALUE,
    /**
     * After a value in a key-value map but before any following comma or
     * close bracket.
     */
    AFTER_VALUE,
    ;
  }

  /**
   * The maximum nesting depth. According to RFC4627 it is implementation-specific.
   */
  private final int maximumNestingDepth;

  private final String jsonish;

  /**
   * The number of brackets that have been entered and not subsequently exited.
   * Also, the length of the used prefix of {@link #isMap}.
   */
  private int bracketDepth;
  /**
   * {@code isMap[i]} when {@code 0 <= i && i < bracketDepth} is true iff
   * the i-th open bracket was a '{', not a '['.
   */
  private boolean[] isMap;
  /**
   * If non-null, then contains the sanitized form of
   * {@code jsonish.substring(0, cleaned)}.
   * If {@code null}, then no unclean constructs have been found in
   * {@code jsonish} yet.
   */
  private StringBuilder sanitizedJson;
  /**
   * The length of the prefix of {@link #jsonish} that has been written onto
   * {@link #sanitizedJson}.
   */
  private int cleaned;

  private static final boolean SUPER_VERBOSE_AND_SLOW_LOGGING = false;

  JsonSanitizer(String jsonish) {
    this(jsonish, DEFAULT_NESTING_DEPTH);
  }

  JsonSanitizer(String jsonish, int maximumNestingDepth) {
    this.maximumNestingDepth = Math.min(Math.max(1, maximumNestingDepth),MAXIMUM_NESTING_DEPTH);
    if (SUPER_VERBOSE_AND_SLOW_LOGGING) {
      System.err.println("\n" + jsonish + "\n========");
    }
    this.jsonish = jsonish != null ? jsonish : "null";
  }

  int getMaximumNestingDepth() {
    return this.maximumNestingDepth;
  }

  void sanitize() {
    // Return to consistent state.
    bracketDepth = cleaned = 0;
    sanitizedJson = null;

    State state = State.START_ARRAY;
    int n = jsonish.length();

    // Walk over each token and either validate it, by just advancing i and
    // computing the next state, or manipulate cleaned&sanitizedJson so that
    // sanitizedJson contains the sanitized equivalent of
    // jsonish.substring(0, cleaned).
    token_loop:
    for (int i = 0; i < n; ++i) {
      try {
        char ch = jsonish.charAt(i);
        if (SUPER_VERBOSE_AND_SLOW_LOGGING) {
          String sanitizedJsonStr =
            (sanitizedJson == null ? "" : sanitizedJson)
            + jsonish.substring(cleaned, i);
          System.err.println("i=" + i + ", ch=" + ch + ", state=" + state
                             + ", sanitized=" + sanitizedJsonStr);
        }
        switch (ch) {
          case '\t': case '\n': case '\r': case ' ':
            break;

          case '"': case '\'':
            state = requireValueState(i, state, true);
            int strEnd = endOfQuotedString(jsonish, i);
            sanitizeString(i, strEnd);
            i = strEnd - 1;
            break;

          case '(': case ')':
            // Often JSON-like content which is meant for use by eval is
            // wrapped in parentheses so that the JS parser treats contained
            // curly brackets as part of an object constructor instead of a
            // block statement.
            // We elide these grouping parentheses to ensure valid JSON.
            elide(i, i + 1);
            break;

          case '{': case '[':
            state = requireValueState(i, state, false);
            if (isMap == null) {
              isMap = new boolean[maximumNestingDepth];
            }
            boolean map = ch == '{';
            isMap[bracketDepth] = map;
            ++bracketDepth;
            state = map ? State.START_MAP : State.START_ARRAY;
            break;

          case '}': case ']':
            if (bracketDepth == 0) {
              elide(i, jsonish.length());
              break token_loop;
            }

            // Strip trailing comma to convert {"a":0,} -> {"a":0}
            // and [1,2,3,] -> [1,2,3,]
            switch (state) {
              case BEFORE_VALUE:
                insert(i, "null");
                break;
              case BEFORE_ELEMENT: case BEFORE_KEY:
                elideTrailingComma(i);
                break;
              case AFTER_KEY:
                insert(i, ":null");
                break;
              case START_MAP: case START_ARRAY:
              case AFTER_ELEMENT: case AFTER_VALUE: break;
            }

            --bracketDepth;
            char closeBracket = isMap[bracketDepth] ? '}' : ']';
            if (ch != closeBracket) {
              replace(i, i + 1, closeBracket);
            }
            state = bracketDepth == 0 || !isMap[bracketDepth - 1]
                ? State.AFTER_ELEMENT : State.AFTER_VALUE;
            break;
          case ',':
            if (bracketDepth == 0) { throw UNBRACKETED_COMMA; }
            // Convert comma elisions like [1,,3] to [1,null,3].
            // [1,,3] in JS is an array that has no element at index 1
            // according to the "in" operator so accessing index 1 will
            // yield the special value "undefined" which is equivalent to
            // JS's "null" value according to "==".
            switch (state) {
              // Normal
              case AFTER_ELEMENT:
                state = State.BEFORE_ELEMENT;
                break;
              case AFTER_VALUE:
                state = State.BEFORE_KEY;
                break;
              // Array elision.
              case START_ARRAY: case BEFORE_ELEMENT:
                insert(i, "null");
                state = State.BEFORE_ELEMENT;
                break;
              // Ignore
              case START_MAP: case BEFORE_KEY:
              case AFTER_KEY:
                elide(i, i + 1);
                break;
              // Supply missing value.
              case BEFORE_VALUE:
                insert(i, "null");
                state = State.BEFORE_KEY;
                break;
            }
            break;

          case ':':
            if (state == State.AFTER_KEY) {
              state = State.BEFORE_VALUE;
            } else {
              elide(i, i + 1);
            }
            break;

          case '/':
            // Skip over JS-style comments since people like inserting them into
            // data files and getting huffy with Crockford when he says no to
            // versioning JSON to allow ignorable tokens.
            int end = i + 1;
            if (i + 1 < n) {
              switch (jsonish.charAt(i + 1)) {
                case '/':
                  end = n;  // Worst case.
                  for (int j = i + 2; j < n; ++j) {
                    char cch = jsonish.charAt(j);
                    if (cch == '\n' || cch == '\r'
                        || cch == '\u2028' || cch == '\u2029') {
                      end = j + 1;
                      break;
                    }
                  }
                  break;
                case '*':
                  end = n;
                  if (i + 3 < n) {
                    for (int j = i + 2;
                         (j = jsonish.indexOf('/', j + 1)) >= 0;) {
                      if (jsonish.charAt(j - 1) == '*') {
                        end = j + 1;
                        break;
                      }
                    }
                  }
                  break;
              }
            }
            elide(i, end);
            i = end - 1;
            break;

          default:
            // Three kinds of other values can occur.
            // 1. Numbers
            // 2. Keyword values ("false", "null", "true")
            // 3. Unquoted JS property names as in the JS expression
            //      ({ foo: "bar"})
            //    which is equivalent to the JSON
            //      { "foo": "bar" }
            // 4. Cruft tokens like BOMs.

            // Look for a run of '.', [0-9], [a-zA-Z_$], [+-] which subsumes
            // all the above without including any JSON special characters
            // outside keyword and number.
            int runEnd;
            for (runEnd = i; runEnd < n; ++runEnd) {
              char tch = jsonish.charAt(runEnd);
              if (('a' <= tch && tch <= 'z') || ('0' <= tch && tch <= '9')
                  || tch == '+' || tch == '-' || tch == '.'
                  || ('A' <= tch && tch <= 'Z') || tch == '_' || tch == '$') {
                continue;
              }
              break;
            }

            if (runEnd == i) {
              elide(i, i + 1);
              break;
            }

            state = requireValueState(i, state, true);

            boolean isNumber = ('0' <= ch && ch <= '9')
               || ch == '.' || ch == '+' || ch == '-';
            boolean isKeyword = !isNumber && isKeyword(i, runEnd);

            if (!(isNumber || isKeyword)) {
              // We're going to have to quote the output.  Further expand to
              // include more of an unquoted token in a string.
              for (; runEnd < n; ++runEnd) {
                if (isJsonSpecialChar(runEnd)) {
                  break;
                }
              }
              if (runEnd < n && jsonish.charAt(runEnd) == '"') {
                ++runEnd;
              }
            }

            if (state == State.AFTER_KEY) {
              // We need to quote whatever we have since it is used as a
              // property name in a map and only quoted strings can be used that
              // way in JSON.
              insert(i, '"');
              if (isNumber) {
                // By JS rules,
                //   { .5e-1: "bar" }
                // is the same as
                //   { "0.05": "bar" }
                // because a number literal is converted to its string form
                // before being used as a property name.
                canonicalizeNumber(i, runEnd);
                // We intentionally ignore the return value of canonicalize.
                // Uncanonicalizable numbers just get put straight through as
                // string values.
                insert(runEnd, '"');
              } else {
                sanitizeString(i, runEnd);
              }
            } else {
              if (isNumber) {
                // Convert hex and octal constants to decimal and ensure that
                // integer and fraction portions are not empty.
                normalizeNumber(i, runEnd);
              } else if (!isKeyword) {
                // Treat as an unquoted string literal.
                insert(i, '"');
                sanitizeString(i, runEnd);
              }
            }
            i = runEnd - 1;
        }
      } catch (@SuppressWarnings("unused") UnbracketedComma e) {
        elide(i, jsonish.length());
        break;
      }
    }

    if (state == State.START_ARRAY && bracketDepth == 0) {
      // No tokens.  Only whitespace
      insert(n, "null");
      state = State.AFTER_ELEMENT;
    }

    if (SUPER_VERBOSE_AND_SLOW_LOGGING) {
      System.err.println(
          "state=" + state + ", sanitizedJson=" + sanitizedJson
          + ", cleaned=" + cleaned + ", bracketDepth=" + bracketDepth);
    }

    if ((sanitizedJson != null && sanitizedJson.length() != 0)
        || cleaned != 0 || bracketDepth != 0) {
      if (sanitizedJson == null) {
        sanitizedJson = new StringBuilder(n + bracketDepth);
      }
      sanitizedJson.append(jsonish, cleaned, n);
      cleaned = n;

      switch (state) {
        case BEFORE_ELEMENT: case BEFORE_KEY:
          elideTrailingComma(n);
          break;
        case AFTER_KEY:
          sanitizedJson.append(":null");
          break;
        case BEFORE_VALUE:
          sanitizedJson.append("null");
          break;
        default: break;
      }

      // Insert brackets to close unclosed content.
      while (bracketDepth != 0) {
        sanitizedJson.append(isMap[--bracketDepth] ? '}' : ']');
      }
    }
  }

  /**
   * Ensures that the output corresponding to {@code jsonish[start:end]} is a
   * valid JSON string that has the same meaning when parsed by Javascript
   * {@code eval}.
   * 

   *   Making sure that it is fully quoted with double-quotes.
   *   
Escaping any Javascript newlines : CR, LF, U+2028, U+2029
   *   
Escaping HTML special characters to allow it to be safely embedded
   *       in HTML {@code