Wie entferne ich ein Java-String-Literal in Java?

75

Ich verarbeite Java-Quellcode mit Java. Ich extrahiere die String-Literale und füge sie einer Funktion zu, die einen String nimmt. Das Problem ist, dass ich die nicht entkappte Version des Strings an die Funktion übergeben muss (dh dies bedeutet, dass \nin eine neue Zeile und \\in eine einzelne Zeile \usw. konvertiert wird ).

Gibt es eine Funktion in der Java-API, die dies tut? Wenn nicht, kann ich solche Funktionen aus einer Bibliothek erhalten? Offensichtlich muss der Java-Compiler diese Konvertierung durchführen.

Falls jemand es wissen möchte, versuche ich, String-Literale in dekompilierten, verschleierten Java-Dateien zu verschleiern.

Zikzystar
quelle

Antworten:

103

Das Problem

Die org.apache.commons.lang.StringEscapeUtils.unescapeJava()hier als weitere Antwort gegebene ist wirklich sehr wenig Hilfe.

  • Es vergisst \0für null.
  • Es behandelt nicht Oktal überhaupt .
  • Es kann nicht die Art von Fluchten Griff durch die zugelassen java.util.regex.Pattern.compile()und alles, was verwendet es, auch \a, \eund vor allem \cX.
  • Logische Unicode-Codepunkte nach Nummer werden nicht unterstützt, nur für UTF-16.
  • Dies sieht aus wie UCS-2-Code, nicht wie UTF-16-Code: Sie verwenden die veraltete charAtSchnittstelle anstelle der codePointSchnittstelle und verbreiten so die Täuschung, dass Java chargarantiert ein Unicode-Zeichen enthält. Es ist nicht. Sie kommen nur damit durch, weil kein UTF-16-Ersatz nach etwas sucht, das sie suchen.

Die Lösung

Ich habe einen String Unescaper geschrieben, der die Frage des OP ohne alle Irritationen des Apache-Codes löst.

/*
 *
 * unescape_perl_string()
 *
 *      Tom Christiansen <[email protected]>
 *      Sun Nov 28 12:55:24 MST 2010
 *
 * It's completely ridiculous that there's no standard
 * unescape_java_string function.  Since I have to do the
 * damn thing myself, I might as well make it halfway useful
 * by supporting things Java was too stupid to consider in
 * strings:
 * 
 *   => "?" items  are additions to Java string escapes
 *                 but normal in Java regexes
 *
 *   => "!" items  are also additions to Java regex escapes
 *   
 * Standard singletons: ?\a ?\e \f \n \r \t
 * 
 *      NB: \b is unsupported as backspace so it can pass-through
 *          to the regex translator untouched; I refuse to make anyone
 *          doublebackslash it as doublebackslashing is a Java idiocy
 *          I desperately wish would die out.  There are plenty of
 *          other ways to write it:
 *
 *              \cH, \12, \012, \x08 \x{8}, \u0008, \U00000008
 *
 * Octal escapes: \0 \0N \0NN \N \NN \NNN
 *    Can range up to !\777 not \377
 *    
 *      TODO: add !\o{NNNNN}
 *          last Unicode is 4177777
 *          maxint is 37777777777
 *
 * Control chars: ?\cX
 *      Means: ord(X) ^ ord('@')
 *
 * Old hex escapes: \xXX
 *      unbraced must be 2 xdigits
 *
 * Perl hex escapes: !\x{XXX} braced may be 1-8 xdigits
 *       NB: proper Unicode never needs more than 6, as highest
 *           valid codepoint is 0x10FFFF, not maxint 0xFFFFFFFF
 *
 * Lame Java escape: \[IDIOT JAVA PREPROCESSOR]uXXXX must be
 *                   exactly 4 xdigits;
 *
 *       I can't write XXXX in this comment where it belongs
 *       because the damned Java Preprocessor can't mind its
 *       own business.  Idiots!
 *
 * Lame Python escape: !\UXXXXXXXX must be exactly 8 xdigits
 * 
 * TODO: Perl translation escapes: \Q \U \L \E \[IDIOT JAVA PREPROCESSOR]u \l
 *       These are not so important to cover if you're passing the
 *       result to Pattern.compile(), since it handles them for you
 *       further downstream.  Hm, what about \[IDIOT JAVA PREPROCESSOR]u?
 *
 */

public final static
String unescape_perl_string(String oldstr) {

    /*
     * In contrast to fixing Java's broken regex charclasses,
     * this one need be no bigger, as unescaping shrinks the string
     * here, where in the other one, it grows it.
     */

    StringBuffer newstr = new StringBuffer(oldstr.length());

    boolean saw_backslash = false;

    for (int i = 0; i < oldstr.length(); i++) {
        int cp = oldstr.codePointAt(i);
        if (oldstr.codePointAt(i) > Character.MAX_VALUE) {
            i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
        }

        if (!saw_backslash) {
            if (cp == '\\') {
                saw_backslash = true;
            } else {
                newstr.append(Character.toChars(cp));
            }
            continue; /* switch */
        }

        if (cp == '\\') {
            saw_backslash = false;
            newstr.append('\\');
            newstr.append('\\');
            continue; /* switch */
        }

        switch (cp) {

            case 'r':  newstr.append('\r');
                       break; /* switch */

            case 'n':  newstr.append('\n');
                       break; /* switch */

            case 'f':  newstr.append('\f');
                       break; /* switch */

            /* PASS a \b THROUGH!! */
            case 'b':  newstr.append("\\b");
                       break; /* switch */

            case 't':  newstr.append('\t');
                       break; /* switch */

            case 'a':  newstr.append('\007');
                       break; /* switch */

            case 'e':  newstr.append('\033');
                       break; /* switch */

            /*
             * A "control" character is what you get when you xor its
             * codepoint with '@'==64.  This only makes sense for ASCII,
             * and may not yield a "control" character after all.
             *
             * Strange but true: "\c{" is ";", "\c}" is "=", etc.
             */
            case 'c':   {
                if (++i == oldstr.length()) { die("trailing \\c"); }
                cp = oldstr.codePointAt(i);
                /*
                 * don't need to grok surrogates, as next line blows them up
                 */
                if (cp > 0x7f) { die("expected ASCII after \\c"); }
                newstr.append(Character.toChars(cp ^ 64));
                break; /* switch */
            }

            case '8':
            case '9': die("illegal octal digit");
                      /* NOTREACHED */

    /*
     * may be 0 to 2 octal digits following this one
     * so back up one for fallthrough to next case;
     * unread this digit and fall through to next case.
     */
            case '1':
            case '2':
            case '3':
            case '4':
            case '5':
            case '6':
            case '7': --i;
                      /* FALLTHROUGH */

            /*
             * Can have 0, 1, or 2 octal digits following a 0
             * this permits larger values than octal 377, up to
             * octal 777.
             */
            case '0': {
                if (i+1 == oldstr.length()) {
                    /* found \0 at end of string */
                    newstr.append(Character.toChars(0));
                    break; /* switch */
                }
                i++;
                int digits = 0;
                int j;
                for (j = 0; j <= 2; j++) {
                    if (i+j == oldstr.length()) {
                        break; /* for */
                    }
                    /* safe because will unread surrogate */
                    int ch = oldstr.charAt(i+j);
                    if (ch < '0' || ch > '7') {
                        break; /* for */
                    }
                    digits++;
                }
                if (digits == 0) {
                    --i;
                    newstr.append('\0');
                    break; /* switch */
                }
                int value = 0;
                try {
                    value = Integer.parseInt(
                                oldstr.substring(i, i+digits), 8);
                } catch (NumberFormatException nfe) {
                    die("invalid octal value for \\0 escape");
                }
                newstr.append(Character.toChars(value));
                i += digits-1;
                break; /* switch */
            } /* end case '0' */

            case 'x':  {
                if (i+2 > oldstr.length()) {
                    die("string too short for \\x escape");
                }
                i++;
                boolean saw_brace = false;
                if (oldstr.charAt(i) == '{') {
                        /* ^^^^^^ ok to ignore surrogates here */
                    i++;
                    saw_brace = true;
                }
                int j;
                for (j = 0; j < 8; j++) {

                    if (!saw_brace && j == 2) {
                        break;  /* for */
                    }

                    /*
                     * ASCII test also catches surrogates
                     */
                    int ch = oldstr.charAt(i+j);
                    if (ch > 127) {
                        die("illegal non-ASCII hex digit in \\x escape");
                    }

                    if (saw_brace && ch == '}') { break; /* for */ }

                    if (! ( (ch >= '0' && ch <= '9')
                                ||
                            (ch >= 'a' && ch <= 'f')
                                ||
                            (ch >= 'A' && ch <= 'F')
                          )
                       )
                    {
                        die(String.format(
                            "illegal hex digit #%d '%c' in \\x", ch, ch));
                    }

                }
                if (j == 0) { die("empty braces in \\x{} escape"); }
                int value = 0;
                try {
                    value = Integer.parseInt(oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \\x escape");
                }
                newstr.append(Character.toChars(value));
                if (saw_brace) { j++; }
                i += j-1;
                break; /* switch */
            }

            case 'u': {
                if (i+4 > oldstr.length()) {
                    die("string too short for \\u escape");
                }
                i++;
                int j;
                for (j = 0; j < 4; j++) {
                    /* this also handles the surrogate issue */
                    if (oldstr.charAt(i+j) > 127) {
                        die("illegal non-ASCII hex digit in \\u escape");
                    }
                }
                int value = 0;
                try {
                    value = Integer.parseInt( oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \\u escape");
                }
                newstr.append(Character.toChars(value));
                i += j-1;
                break; /* switch */
            }

            case 'U': {
                if (i+8 > oldstr.length()) {
                    die("string too short for \\U escape");
                }
                i++;
                int j;
                for (j = 0; j < 8; j++) {
                    /* this also handles the surrogate issue */
                    if (oldstr.charAt(i+j) > 127) {
                        die("illegal non-ASCII hex digit in \\U escape");
                    }
                }
                int value = 0;
                try {
                    value = Integer.parseInt(oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \\U escape");
                }
                newstr.append(Character.toChars(value));
                i += j-1;
                break; /* switch */
            }

            default:   newstr.append('\\');
                       newstr.append(Character.toChars(cp));
           /*
            * say(String.format(
            *       "DEFAULT unrecognized escape %c passed through",
            *       cp));
            */
                       break; /* switch */

        }
        saw_backslash = false;
    }

    /* weird to leave one at the end */
    if (saw_backslash) {
        newstr.append('\\');
    }

    return newstr.toString();
}

/*
 * Return a string "U+XX.XXX.XXXX" etc, where each XX set is the
 * xdigits of the logical Unicode code point. No bloody brain-damaged
 * UTF-16 surrogate crap, just true logical characters.
 */
 public final static
 String uniplus(String s) {
     if (s.length() == 0) {
         return "";
     }
     /* This is just the minimum; sb will grow as needed. */
     StringBuffer sb = new StringBuffer(2 + 3 * s.length());
     sb.append("U+");
     for (int i = 0; i < s.length(); i++) {
         sb.append(String.format("%X", s.codePointAt(i)));
         if (s.codePointAt(i) > Character.MAX_VALUE) {
             i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
         }
         if (i+1 < s.length()) {
             sb.append(".");
         }
     }
     return sb.toString();
 }

private static final
void die(String foa) {
    throw new IllegalArgumentException(foa);
}

private static final
void say(String what) {
    System.out.println(what);
}

If it helps others, you’re welcome to it — no strings attached. If you improve it, I’d love for you to mail me your enhancements, but you certainly don’t have to.

tchrist
quelle
why is your routine called unescape_perl_string? Also, isn't doing all the extra unescaping for things not defined by the spec a bug since java itself wouldn't interpret a literal that way? Just making sure I'm not missing anything here - the code is complex enough that I'm a bit worried about all the extra bits.
bright
@tchrist Do you have an idea if the issues you described with Apache's approach are still valid or if they fixed it?
sjngm
1
@tchrist Thanks. I just tried your neat method with a text file containing foo\\bar and it returned foo\\bar. I'd have expected it to be foo\bar. Is this a bug or am I misunderstanding the idea behind the method?
sjngm
@sjngm As written it is indeed intended to do a pass-through on unrecognized backslash escapes instead of stripping off one level, because I find the endlessly increasingly double, triple, quadruple, etc backslashing trend to be hard to read and easy to screw up.
tchrist
3
Apache Commons Lang3 fixed a number of these things (at least the octal bit, that was getting me) issues.apache.org/jira/browse/LANG-646
Drew Stephens
48

You can use String unescapeJava(String) method of StringEscapeUtils from Apache Commons Lang.

Here's an example snippet:

    String in = "a\\tb\\n\\\"c\\\"";

    System.out.println(in);
    // a\tb\n\"c\"

    String out = StringEscapeUtils.unescapeJava(in);

    System.out.println(out);
    // a    b
    // "c"

The utility class has methods to escapes and unescape strings for Java, Java Script, HTML, XML, and SQL. It also has overloads that writes directly to a java.io.Writer.


Caveats

It looks like StringEscapeUtils handles Unicode escapes with one u, but not octal escapes, or Unicode escapes with extraneous us.

    /* Unicode escape test #1: PASS */
    
    System.out.println(
        "\u0030"
    ); // 0
    System.out.println(
        StringEscapeUtils.unescapeJava("\\u0030")
    ); // 0
    System.out.println(
        "\u0030".equals(StringEscapeUtils.unescapeJava("\\u0030"))
    ); // true
    
    /* Octal escape test: FAIL */
    
    System.out.println(
        "\45"
    ); // %
    System.out.println(
        StringEscapeUtils.unescapeJava("\\45")
    ); // 45
    System.out.println(
        "\45".equals(StringEscapeUtils.unescapeJava("\\45"))
    ); // false

    /* Unicode escape test #2: FAIL */
    
    System.out.println(
        "\uu0030"
    ); // 0
    System.out.println(
        StringEscapeUtils.unescapeJava("\\uu0030")
    ); // throws NestableRuntimeException:
       //   Unable to parse unicode value: u003

A quote from the JLS:

Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000 through \u00FF, so Unicode escapes are usually preferred.

If your string can contain octal escapes, you may want to convert them to Unicode escapes first, or use another approach.

The extraneous u is also documented as follows:

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u-for example, \uxxxx becomes \uuxxxx-while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

This transformed version is equally acceptable to a compiler for the Java programming language and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

If your string can contain Unicode escapes with extraneous u, then you may also need to preprocess this before using StringEscapeUtils.

Alternatively you can try to write your own Java string literal unescaper from scratch, making sure to follow the exact JLS specifications.

References

polygenelubricants
quelle
I've found this library, too (after posting). I'm experiencing problems with the octal values. I'm currently trying to convert them by hand.
ziggystar
Uh, converting the octals to unicode isn't trivial. Only 0-127 can be easily mapped. Is this right?
ziggystar
@ziggystar: Looks like JLS says 0-255 instead (see quote). The largest octal escape is \377.
polygenelubricants
Ah, you're right. The spec says the octals are unicode values. I suspected them to be ASCII values.
ziggystar
I have this \xf3 that is a accent in spanish. How do I unscape it?
Tiger developer
17

Came across a similar problem, wasn't also satisfied with the presented solutions and implemented this one myself.

Also available as a Gist on Github:

/**
 * Unescapes a string that contains standard Java escape sequences.
 * <ul>
 * <li><strong>&#92;b &#92;f &#92;n &#92;r &#92;t &#92;" &#92;'</strong> :
 * BS, FF, NL, CR, TAB, double and single quote.</li>
 * <li><strong>&#92;X &#92;XX &#92;XXX</strong> : Octal character
 * specification (0 - 377, 0x00 - 0xFF).</li>
 * <li><strong>&#92;uXXXX</strong> : Hexadecimal based Unicode character.</li>
 * </ul>
 * 
 * @param st
 *            A string optionally containing standard java escape sequences.
 * @return The translated string.
 */
public String unescapeJavaString(String st) {

    StringBuilder sb = new StringBuilder(st.length());

    for (int i = 0; i < st.length(); i++) {
        char ch = st.charAt(i);
        if (ch == '\\') {
            char nextChar = (i == st.length() - 1) ? '\\' : st
                    .charAt(i + 1);
            // Octal escape?
            if (nextChar >= '0' && nextChar <= '7') {
                String code = "" + nextChar;
                i++;
                if ((i < st.length() - 1) && st.charAt(i + 1) >= '0'
                        && st.charAt(i + 1) <= '7') {
                    code += st.charAt(i + 1);
                    i++;
                    if ((i < st.length() - 1) && st.charAt(i + 1) >= '0'
                            && st.charAt(i + 1) <= '7') {
                        code += st.charAt(i + 1);
                        i++;
                    }
                }
                sb.append((char) Integer.parseInt(code, 8));
                continue;
            }
            switch (nextChar) {
            case '\\':
                ch = '\\';
                break;
            case 'b':
                ch = '\b';
                break;
            case 'f':
                ch = '\f';
                break;
            case 'n':
                ch = '\n';
                break;
            case 'r':
                ch = '\r';
                break;
            case 't':
                ch = '\t';
                break;
            case '\"':
                ch = '\"';
                break;
            case '\'':
                ch = '\'';
                break;
            // Hex Unicode: u????
            case 'u':
                if (i >= st.length() - 5) {
                    ch = 'u';
                    break;
                }
                int code = Integer.parseInt(
                        "" + st.charAt(i + 2) + st.charAt(i + 3)
                                + st.charAt(i + 4) + st.charAt(i + 5), 16);
                sb.append(Character.toChars(code));
                i += 5;
                continue;
            }
            i++;
        }
        sb.append(ch);
    }
    return sb.toString();
}
Udo Klimaschewski
quelle
3
For extra caution, make a null check at first, and return null in that case. But thanks!
Andreas Rudolph
8

I know this question was old, but I wanted a solution that doesn't involve libraries outside those included JRE6 (i.e. Apache Commons is not acceptable), and I came up with a simple solution using the built-in java.io.StreamTokenizer:

import java.io.*;

// ...

String literal = "\"Has \\\"\\\\\\\t\\\" & isn\\\'t \\\r\\\n on 1 line.\"";
StreamTokenizer parser = new StreamTokenizer(new StringReader(literal));
String result;
try {
  parser.nextToken();
  if (parser.ttype == '"') {
    result = parser.sval;
  }
  else {
    result = "ERROR!";
  }
}
catch (IOException e) {
  result = e.toString();
}
System.out.println(result);

Output:

Has "\  " & isn't
 on 1 line.
DaoWen
quelle
1
@UdoKlimaschewski - You're right. You can look at the source to see which escapes it actually supports.
DaoWen
6

I'm a little late on this, but I thought I'd provide my solution since I needed the same functionality. I decided to use the Java Compiler API which makes it slower, but makes the results accurate. Basically I live create a class then return the results. Here is the method:

public static String[] unescapeJavaStrings(String... escaped) {
    //class name
    final String className = "Temp" + System.currentTimeMillis();
    //build the source
    final StringBuilder source = new StringBuilder(100 + escaped.length * 20).
            append("public class ").append(className).append("{\n").
            append("\tpublic static String[] getStrings() {\n").
            append("\t\treturn new String[] {\n");
    for (String string : escaped) {
        source.append("\t\t\t\"");
        //we escape non-escaped quotes here to be safe 
        //  (but something like \\" will fail, oh well for now)
        for (int i = 0; i < string.length(); i++) {
            char chr = string.charAt(i);
            if (chr == '"' && i > 0 && string.charAt(i - 1) != '\\') {
                source.append('\\');
            }
            source.append(chr);
        }
        source.append("\",\n");
    }
    source.append("\t\t};\n\t}\n}\n");
    //obtain compiler
    final JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
    //local stream for output
    final ByteArrayOutputStream out = new ByteArrayOutputStream();
    //local stream for error
    ByteArrayOutputStream err = new ByteArrayOutputStream();
    //source file
    JavaFileObject sourceFile = new SimpleJavaFileObject(
            URI.create("string:///" + className + Kind.SOURCE.extension), Kind.SOURCE) {
        @Override
        public CharSequence getCharContent(boolean ignoreEncodingErrors) throws IOException {
            return source;
        }
    };
    //target file
    final JavaFileObject targetFile = new SimpleJavaFileObject(
            URI.create("string:///" + className + Kind.CLASS.extension), Kind.CLASS) {
        @Override
        public OutputStream openOutputStream() throws IOException {
            return out;
        }
    };
    //file manager proxy, with most parts delegated to the standard one 
    JavaFileManager fileManagerProxy = (JavaFileManager) Proxy.newProxyInstance(
            StringUtils.class.getClassLoader(), new Class[] { JavaFileManager.class },
            new InvocationHandler() {
                //standard file manager to delegate to
                private final JavaFileManager standard = 
                    compiler.getStandardFileManager(null, null, null); 
                @Override
                public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
                    if ("getJavaFileForOutput".equals(method.getName())) {
                        //return the target file when it's asking for output
                        return targetFile;
                    } else {
                        return method.invoke(standard, args);
                    }
                }
            });
    //create the task
    CompilationTask task = compiler.getTask(new OutputStreamWriter(err), 
            fileManagerProxy, null, null, null, Collections.singleton(sourceFile));
    //call it
    if (!task.call()) {
        throw new RuntimeException("Compilation failed, output:\n" + 
                new String(err.toByteArray()));
    }
    //get the result
    final byte[] bytes = out.toByteArray();
    //load class
    Class<?> clazz;
    try {
        //custom class loader for garbage collection
        clazz = new ClassLoader() { 
            protected Class<?> findClass(String name) throws ClassNotFoundException {
                if (name.equals(className)) {
                    return defineClass(className, bytes, 0, bytes.length);
                } else {
                    return super.findClass(name);
                }
            }
        }.loadClass(className);
    } catch (ClassNotFoundException e) {
        throw new RuntimeException(e);
    }
    //reflectively call method
    try {
        return (String[]) clazz.getDeclaredMethod("getStrings").invoke(null);
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

It takes an array so you can unescape in batches. So the following simple test succeeds:

public static void main(String[] meh) {
    if ("1\02\03\n".equals(unescapeJavaStrings("1\\02\\03\\n")[0])) {
        System.out.println("Success");
    } else {
        System.out.println("Failure");
    }
}
Chad Retz
quelle
4

For the record, if you use Scala, you can do:

StringContext.treatEscapes(escaped)
Tvaroh
quelle
It deals only with ` [\b, \t, \n, \f, \r, \\, \", \']`, it cannot cope with unicode escapes.
Andrey Tyukin
3

I came across the same problem, but I wasn't enamoured by any of the solutions I found here. So, I wrote one that iterates over the characters of the string using a matcher to find and replace the escape sequences. This solution assumes properly formatted input. That is, it happily skips over nonsensical escapes, and it decodes Unicode escapes for line feed and carriage return (which otherwise cannot appear in a character literal or a string literal, due to the definition of such literals and the order of translation phases for Java source). Apologies, the code is a bit packed for brevity.

import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Decoder {

    // The encoded character of each character escape.
    // This array functions as the keys of a sorted map, from encoded characters to decoded characters.
    static final char[] ENCODED_ESCAPES = { '\"', '\'', '\\',  'b',  'f',  'n',  'r',  't' };

    // The decoded character of each character escape.
    // This array functions as the values of a sorted map, from encoded characters to decoded characters.
    static final char[] DECODED_ESCAPES = { '\"', '\'', '\\', '\b', '\f', '\n', '\r', '\t' };

    // A pattern that matches an escape.
    // What follows the escape indicator is captured by group 1=character 2=octal 3=Unicode.
    static final Pattern PATTERN = Pattern.compile("\\\\(?:(b|t|n|f|r|\\\"|\\\'|\\\\)|((?:[0-3]?[0-7])?[0-7])|u+(\\p{XDigit}{4}))");

    public static CharSequence decodeString(CharSequence encodedString) {
        Matcher matcher = PATTERN.matcher(encodedString);
        StringBuffer decodedString = new StringBuffer();
        // Find each escape of the encoded string in succession.
        while (matcher.find()) {
            char ch;
            if (matcher.start(1) >= 0) {
                // Decode a character escape.
                ch = DECODED_ESCAPES[Arrays.binarySearch(ENCODED_ESCAPES, matcher.group(1).charAt(0))];
            } else if (matcher.start(2) >= 0) {
                // Decode an octal escape.
                ch = (char)(Integer.parseInt(matcher.group(2), 8));
            } else /* if (matcher.start(3) >= 0) */ {
                // Decode a Unicode escape.
                ch = (char)(Integer.parseInt(matcher.group(3), 16));
            }
            // Replace the escape with the decoded character.
            matcher.appendReplacement(decodedString, Matcher.quoteReplacement(String.valueOf(ch)));
        }
        // Append the remainder of the encoded string to the decoded string.
        // The remainder is the longest suffix of the encoded string such that the suffix contains no escapes.
        matcher.appendTail(decodedString);
        return decodedString;
    }

    public static void main(String... args) {
        System.out.println(decodeString(args[0]));
    }
}

I should note that Apache Commons Lang3 doesn't seem to suffer the weaknesses indicated in the accepted solution. That is, StringEscapeUtils seems to handle octal escapes and multiple u characters of Unicode escapes. That means unless you have some burning reason to avoid Apache Commons, you should probably use it rather than my solution (or any other solution here).

Nathan Ryan
quelle
2

org.apache.commons.lang3.StringEscapeUtils from commons-lang3 is marked deprecated now. You can use org.apache.commons.text.StringEscapeUtils#unescapeJava(String) instead. It requires an additional Maven dependency:

        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-text</artifactId>
            <version>1.4</version>
        </dependency>

and seems to handle some more special cases, it e.g. unescapes:

  • escaped backslashes, single and double quotes
  • escaped octal and unicode values
  • \\b, \\n, \\t, \\f, \\r
Jens Piegsa
quelle
2

Java 13 added a method which does this: String#translateEscapes.

It was a preview feature in Java 13 and 14, but was promoted to a full feature in Java 15.

Alex - GlassEditor.com
quelle
0

If you are reading unicode escaped chars from a file, then you will have a tough time doing that because the string will be read literally along with an escape for the back slash:

my_file.txt

Blah blah...
Column delimiter=;
Word delimiter=\u0020 #This is just unicode for whitespace

.. more stuff

Here, when you read line 3 from the file the string/line will have:

"Word delimiter=\u0020 #This is just unicode for whitespace"

and the char[] in the string will show:

{...., '=', '\\', 'u', '0', '0', '2', '0', ' ', '#', 't', 'h', ...}

Commons StringUnescape will not unescape this for you (I tried unescapeXml()). You'll have to do it manually as described here.

So, the sub-string "\u0020" should become 1 single char '\u0020'

But if you are using this "\u0020" to do String.split("... ..... ..", columnDelimiterReadFromFile) which is really using regex internally, it will work directly because the string read from file was escaped and is perfect to use in the regex pattern!! (Confused?)

Ashwin Jayaprakash
quelle