Das Problem
Die org.apache.commons.lang.StringEscapeUtils.unescapeJava()
hier als weitere Antwort gegebene ist wirklich sehr wenig Hilfe.
- Es vergisst
\0
für null.
- Es behandelt nicht Oktal überhaupt .
- Es kann nicht die Art von Fluchten Griff durch die zugelassen
java.util.regex.Pattern.compile()
und alles, was verwendet es, auch \a
, \e
und vor allem \cX
.
- Logische Unicode-Codepunkte nach Nummer werden nicht unterstützt, nur für UTF-16.
- Dies sieht aus wie UCS-2-Code, nicht wie UTF-16-Code: Sie verwenden die veraltete
charAt
Schnittstelle anstelle der codePoint
Schnittstelle und verbreiten so die Täuschung, dass Java char
garantiert ein Unicode-Zeichen enthält. Es ist nicht. Sie kommen nur damit durch, weil kein UTF-16-Ersatz nach etwas sucht, das sie suchen.
Die Lösung
Ich habe einen String Unescaper geschrieben, der die Frage des OP ohne alle Irritationen des Apache-Codes löst.
public final static
String unescape_perl_string(String oldstr) {
StringBuffer newstr = new StringBuffer(oldstr.length());
boolean saw_backslash = false;
for (int i = 0; i < oldstr.length(); i++) {
int cp = oldstr.codePointAt(i);
if (oldstr.codePointAt(i) > Character.MAX_VALUE) {
i++;
}
if (!saw_backslash) {
if (cp == '\\') {
saw_backslash = true;
} else {
newstr.append(Character.toChars(cp));
}
continue;
}
if (cp == '\\') {
saw_backslash = false;
newstr.append('\\');
newstr.append('\\');
continue;
}
switch (cp) {
case 'r': newstr.append('\r');
break;
case 'n': newstr.append('\n');
break;
case 'f': newstr.append('\f');
break;
case 'b': newstr.append("\\b");
break;
case 't': newstr.append('\t');
break;
case 'a': newstr.append('\007');
break;
case 'e': newstr.append('\033');
break;
case 'c': {
if (++i == oldstr.length()) { die("trailing \\c"); }
cp = oldstr.codePointAt(i);
if (cp > 0x7f) { die("expected ASCII after \\c"); }
newstr.append(Character.toChars(cp ^ 64));
break;
}
case '8':
case '9': die("illegal octal digit");
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7': --i;
case '0': {
if (i+1 == oldstr.length()) {
newstr.append(Character.toChars(0));
break;
}
i++;
int digits = 0;
int j;
for (j = 0; j <= 2; j++) {
if (i+j == oldstr.length()) {
break;
}
int ch = oldstr.charAt(i+j);
if (ch < '0' || ch > '7') {
break;
}
digits++;
}
if (digits == 0) {
--i;
newstr.append('\0');
break;
}
int value = 0;
try {
value = Integer.parseInt(
oldstr.substring(i, i+digits), 8);
} catch (NumberFormatException nfe) {
die("invalid octal value for \\0 escape");
}
newstr.append(Character.toChars(value));
i += digits-1;
break;
}
case 'x': {
if (i+2 > oldstr.length()) {
die("string too short for \\x escape");
}
i++;
boolean saw_brace = false;
if (oldstr.charAt(i) == '{') {
i++;
saw_brace = true;
}
int j;
for (j = 0; j < 8; j++) {
if (!saw_brace && j == 2) {
break;
}
int ch = oldstr.charAt(i+j);
if (ch > 127) {
die("illegal non-ASCII hex digit in \\x escape");
}
if (saw_brace && ch == '}') { break; }
if (! ( (ch >= '0' && ch <= '9')
||
(ch >= 'a' && ch <= 'f')
||
(ch >= 'A' && ch <= 'F')
)
)
{
die(String.format(
"illegal hex digit #%d '%c' in \\x", ch, ch));
}
}
if (j == 0) { die("empty braces in \\x{} escape"); }
int value = 0;
try {
value = Integer.parseInt(oldstr.substring(i, i+j), 16);
} catch (NumberFormatException nfe) {
die("invalid hex value for \\x escape");
}
newstr.append(Character.toChars(value));
if (saw_brace) { j++; }
i += j-1;
break;
}
case 'u': {
if (i+4 > oldstr.length()) {
die("string too short for \\u escape");
}
i++;
int j;
for (j = 0; j < 4; j++) {
if (oldstr.charAt(i+j) > 127) {
die("illegal non-ASCII hex digit in \\u escape");
}
}
int value = 0;
try {
value = Integer.parseInt( oldstr.substring(i, i+j), 16);
} catch (NumberFormatException nfe) {
die("invalid hex value for \\u escape");
}
newstr.append(Character.toChars(value));
i += j-1;
break;
}
case 'U': {
if (i+8 > oldstr.length()) {
die("string too short for \\U escape");
}
i++;
int j;
for (j = 0; j < 8; j++) {
if (oldstr.charAt(i+j) > 127) {
die("illegal non-ASCII hex digit in \\U escape");
}
}
int value = 0;
try {
value = Integer.parseInt(oldstr.substring(i, i+j), 16);
} catch (NumberFormatException nfe) {
die("invalid hex value for \\U escape");
}
newstr.append(Character.toChars(value));
i += j-1;
break;
}
default: newstr.append('\\');
newstr.append(Character.toChars(cp));
break;
}
saw_backslash = false;
}
if (saw_backslash) {
newstr.append('\\');
}
return newstr.toString();
}
public final static
String uniplus(String s) {
if (s.length() == 0) {
return "";
}
StringBuffer sb = new StringBuffer(2 + 3 * s.length());
sb.append("U+");
for (int i = 0; i < s.length(); i++) {
sb.append(String.format("%X", s.codePointAt(i)));
if (s.codePointAt(i) > Character.MAX_VALUE) {
i++;
}
if (i+1 < s.length()) {
sb.append(".");
}
}
return sb.toString();
}
private static final
void die(String foa) {
throw new IllegalArgumentException(foa);
}
private static final
void say(String what) {
System.out.println(what);
}
If it helps others, you’re welcome to it — no strings attached. If you improve it, I’d love for you to mail me your enhancements, but you certainly don’t have to.
foo\\bar
and it returnedfoo\\bar
. I'd have expected it to befoo\bar
. Is this a bug or am I misunderstanding the idea behind the method?You can use
String unescapeJava(String)
method ofStringEscapeUtils
from Apache Commons Lang.Here's an example snippet:
String in = "a\\tb\\n\\\"c\\\""; System.out.println(in); // a\tb\n\"c\" String out = StringEscapeUtils.unescapeJava(in); System.out.println(out); // a b // "c"
The utility class has methods to escapes and unescape strings for Java, Java Script, HTML, XML, and SQL. It also has overloads that writes directly to a
java.io.Writer
.Caveats
It looks like
StringEscapeUtils
handles Unicode escapes with oneu
, but not octal escapes, or Unicode escapes with extraneousu
s./* Unicode escape test #1: PASS */ System.out.println( "\u0030" ); // 0 System.out.println( StringEscapeUtils.unescapeJava("\\u0030") ); // 0 System.out.println( "\u0030".equals(StringEscapeUtils.unescapeJava("\\u0030")) ); // true /* Octal escape test: FAIL */ System.out.println( "\45" ); // % System.out.println( StringEscapeUtils.unescapeJava("\\45") ); // 45 System.out.println( "\45".equals(StringEscapeUtils.unescapeJava("\\45")) ); // false /* Unicode escape test #2: FAIL */ System.out.println( "\uu0030" ); // 0 System.out.println( StringEscapeUtils.unescapeJava("\\uu0030") ); // throws NestableRuntimeException: // Unable to parse unicode value: u003
A quote from the JLS:
If your string can contain octal escapes, you may want to convert them to Unicode escapes first, or use another approach.
The extraneous
u
is also documented as follows:If your string can contain Unicode escapes with extraneous
u
, then you may also need to preprocess this before usingStringEscapeUtils
.Alternatively you can try to write your own Java string literal unescaper from scratch, making sure to follow the exact JLS specifications.
References
quelle
0-255
instead (see quote). The largest octal escape is\377
.Came across a similar problem, wasn't also satisfied with the presented solutions and implemented this one myself.
Also available as a Gist on Github:
/** * Unescapes a string that contains standard Java escape sequences. * <ul> * <li><strong>\b \f \n \r \t \" \'</strong> : * BS, FF, NL, CR, TAB, double and single quote.</li> * <li><strong>\X \XX \XXX</strong> : Octal character * specification (0 - 377, 0x00 - 0xFF).</li> * <li><strong>\uXXXX</strong> : Hexadecimal based Unicode character.</li> * </ul> * * @param st * A string optionally containing standard java escape sequences. * @return The translated string. */ public String unescapeJavaString(String st) { StringBuilder sb = new StringBuilder(st.length()); for (int i = 0; i < st.length(); i++) { char ch = st.charAt(i); if (ch == '\\') { char nextChar = (i == st.length() - 1) ? '\\' : st .charAt(i + 1); // Octal escape? if (nextChar >= '0' && nextChar <= '7') { String code = "" + nextChar; i++; if ((i < st.length() - 1) && st.charAt(i + 1) >= '0' && st.charAt(i + 1) <= '7') { code += st.charAt(i + 1); i++; if ((i < st.length() - 1) && st.charAt(i + 1) >= '0' && st.charAt(i + 1) <= '7') { code += st.charAt(i + 1); i++; } } sb.append((char) Integer.parseInt(code, 8)); continue; } switch (nextChar) { case '\\': ch = '\\'; break; case 'b': ch = '\b'; break; case 'f': ch = '\f'; break; case 'n': ch = '\n'; break; case 'r': ch = '\r'; break; case 't': ch = '\t'; break; case '\"': ch = '\"'; break; case '\'': ch = '\''; break; // Hex Unicode: u???? case 'u': if (i >= st.length() - 5) { ch = 'u'; break; } int code = Integer.parseInt( "" + st.charAt(i + 2) + st.charAt(i + 3) + st.charAt(i + 4) + st.charAt(i + 5), 16); sb.append(Character.toChars(code)); i += 5; continue; } i++; } sb.append(ch); } return sb.toString(); }
quelle
See this from http://commons.apache.org/lang/:
StringEscapeUtils
StringEscapeUtils.unescapeJava(String str)
quelle
I know this question was old, but I wanted a solution that doesn't involve libraries outside those included JRE6 (i.e. Apache Commons is not acceptable), and I came up with a simple solution using the built-in
java.io.StreamTokenizer
:import java.io.*; // ... String literal = "\"Has \\\"\\\\\\\t\\\" & isn\\\'t \\\r\\\n on 1 line.\""; StreamTokenizer parser = new StreamTokenizer(new StringReader(literal)); String result; try { parser.nextToken(); if (parser.ttype == '"') { result = parser.sval; } else { result = "ERROR!"; } } catch (IOException e) { result = e.toString(); } System.out.println(result);
Output:
Has "\ " & isn't on 1 line.
quelle
I'm a little late on this, but I thought I'd provide my solution since I needed the same functionality. I decided to use the Java Compiler API which makes it slower, but makes the results accurate. Basically I live create a class then return the results. Here is the method:
public static String[] unescapeJavaStrings(String... escaped) { //class name final String className = "Temp" + System.currentTimeMillis(); //build the source final StringBuilder source = new StringBuilder(100 + escaped.length * 20). append("public class ").append(className).append("{\n"). append("\tpublic static String[] getStrings() {\n"). append("\t\treturn new String[] {\n"); for (String string : escaped) { source.append("\t\t\t\""); //we escape non-escaped quotes here to be safe // (but something like \\" will fail, oh well for now) for (int i = 0; i < string.length(); i++) { char chr = string.charAt(i); if (chr == '"' && i > 0 && string.charAt(i - 1) != '\\') { source.append('\\'); } source.append(chr); } source.append("\",\n"); } source.append("\t\t};\n\t}\n}\n"); //obtain compiler final JavaCompiler compiler = ToolProvider.getSystemJavaCompiler(); //local stream for output final ByteArrayOutputStream out = new ByteArrayOutputStream(); //local stream for error ByteArrayOutputStream err = new ByteArrayOutputStream(); //source file JavaFileObject sourceFile = new SimpleJavaFileObject( URI.create("string:///" + className + Kind.SOURCE.extension), Kind.SOURCE) { @Override public CharSequence getCharContent(boolean ignoreEncodingErrors) throws IOException { return source; } }; //target file final JavaFileObject targetFile = new SimpleJavaFileObject( URI.create("string:///" + className + Kind.CLASS.extension), Kind.CLASS) { @Override public OutputStream openOutputStream() throws IOException { return out; } }; //file manager proxy, with most parts delegated to the standard one JavaFileManager fileManagerProxy = (JavaFileManager) Proxy.newProxyInstance( StringUtils.class.getClassLoader(), new Class[] { JavaFileManager.class }, new InvocationHandler() { //standard file manager to delegate to private final JavaFileManager standard = compiler.getStandardFileManager(null, null, null); @Override public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { if ("getJavaFileForOutput".equals(method.getName())) { //return the target file when it's asking for output return targetFile; } else { return method.invoke(standard, args); } } }); //create the task CompilationTask task = compiler.getTask(new OutputStreamWriter(err), fileManagerProxy, null, null, null, Collections.singleton(sourceFile)); //call it if (!task.call()) { throw new RuntimeException("Compilation failed, output:\n" + new String(err.toByteArray())); } //get the result final byte[] bytes = out.toByteArray(); //load class Class<?> clazz; try { //custom class loader for garbage collection clazz = new ClassLoader() { protected Class<?> findClass(String name) throws ClassNotFoundException { if (name.equals(className)) { return defineClass(className, bytes, 0, bytes.length); } else { return super.findClass(name); } } }.loadClass(className); } catch (ClassNotFoundException e) { throw new RuntimeException(e); } //reflectively call method try { return (String[]) clazz.getDeclaredMethod("getStrings").invoke(null); } catch (Exception e) { throw new RuntimeException(e); } }
It takes an array so you can unescape in batches. So the following simple test succeeds:
public static void main(String[] meh) { if ("1\02\03\n".equals(unescapeJavaStrings("1\\02\\03\\n")[0])) { System.out.println("Success"); } else { System.out.println("Failure"); } }
quelle
For the record, if you use Scala, you can do:
quelle
I came across the same problem, but I wasn't enamoured by any of the solutions I found here. So, I wrote one that iterates over the characters of the string using a matcher to find and replace the escape sequences. This solution assumes properly formatted input. That is, it happily skips over nonsensical escapes, and it decodes Unicode escapes for line feed and carriage return (which otherwise cannot appear in a character literal or a string literal, due to the definition of such literals and the order of translation phases for Java source). Apologies, the code is a bit packed for brevity.
import java.util.Arrays; import java.util.regex.Matcher; import java.util.regex.Pattern; public class Decoder { // The encoded character of each character escape. // This array functions as the keys of a sorted map, from encoded characters to decoded characters. static final char[] ENCODED_ESCAPES = { '\"', '\'', '\\', 'b', 'f', 'n', 'r', 't' }; // The decoded character of each character escape. // This array functions as the values of a sorted map, from encoded characters to decoded characters. static final char[] DECODED_ESCAPES = { '\"', '\'', '\\', '\b', '\f', '\n', '\r', '\t' }; // A pattern that matches an escape. // What follows the escape indicator is captured by group 1=character 2=octal 3=Unicode. static final Pattern PATTERN = Pattern.compile("\\\\(?:(b|t|n|f|r|\\\"|\\\'|\\\\)|((?:[0-3]?[0-7])?[0-7])|u+(\\p{XDigit}{4}))"); public static CharSequence decodeString(CharSequence encodedString) { Matcher matcher = PATTERN.matcher(encodedString); StringBuffer decodedString = new StringBuffer(); // Find each escape of the encoded string in succession. while (matcher.find()) { char ch; if (matcher.start(1) >= 0) { // Decode a character escape. ch = DECODED_ESCAPES[Arrays.binarySearch(ENCODED_ESCAPES, matcher.group(1).charAt(0))]; } else if (matcher.start(2) >= 0) { // Decode an octal escape. ch = (char)(Integer.parseInt(matcher.group(2), 8)); } else /* if (matcher.start(3) >= 0) */ { // Decode a Unicode escape. ch = (char)(Integer.parseInt(matcher.group(3), 16)); } // Replace the escape with the decoded character. matcher.appendReplacement(decodedString, Matcher.quoteReplacement(String.valueOf(ch))); } // Append the remainder of the encoded string to the decoded string. // The remainder is the longest suffix of the encoded string such that the suffix contains no escapes. matcher.appendTail(decodedString); return decodedString; } public static void main(String... args) { System.out.println(decodeString(args[0])); } }
I should note that Apache Commons Lang3 doesn't seem to suffer the weaknesses indicated in the accepted solution. That is,
StringEscapeUtils
seems to handle octal escapes and multipleu
characters of Unicode escapes. That means unless you have some burning reason to avoid Apache Commons, you should probably use it rather than my solution (or any other solution here).quelle
org.apache.commons.lang3.StringEscapeUtils
from commons-lang3 is marked deprecated now. You can useorg.apache.commons.text.StringEscapeUtils#unescapeJava(String)
instead. It requires an additional Maven dependency:<dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-text</artifactId> <version>1.4</version> </dependency>
and seems to handle some more special cases, it e.g. unescapes:
\\b
,\\n
,\\t
,\\f
,\\r
quelle
Java 13 added a method which does this:
String#translateEscapes
.It was a preview feature in Java 13 and 14, but was promoted to a full feature in Java 15.
quelle
If you are reading unicode escaped chars from a file, then you will have a tough time doing that because the string will be read literally along with an escape for the back slash:
my_file.txt
Blah blah... Column delimiter=; Word delimiter=\u0020 #This is just unicode for whitespace .. more stuff
Here, when you read line 3 from the file the string/line will have:
"Word delimiter=\u0020 #This is just unicode for whitespace"
and the char[] in the string will show:
{...., '=', '\\', 'u', '0', '0', '2', '0', ' ', '#', 't', 'h', ...}
Commons StringUnescape will not unescape this for you (I tried unescapeXml()). You'll have to do it manually as described here.
So, the sub-string "\u0020" should become 1 single char '\u0020'
But if you are using this "\u0020" to do
String.split("... ..... ..", columnDelimiterReadFromFile)
which is really using regex internally, it will work directly because the string read from file was escaped and is perfect to use in the regex pattern!! (Confused?)quelle