Remove non-ASCII non-printable characters from a String
I get user input including non-ASCII characters and non-printable characters, such as
xc2d
xa0
xe7
xc3ufffdd
xc3ufffdd
xc2xa0
xc3xa7
xa0xa0
for example:
email : abc@gmail.comxa0xa0
street : 123 Main St.xc2xa0
desired output:
email : abc@gmail.com
street : 123 Main St.
What is the best way to removing them using Java?
I tried the following, but doesn't seem to work
public static void main(String args) throws UnsupportedEncodingException {
String s = "abc@gmail\xe9.com";
String email = "abc@gmail.com\xa0\xa0";
System.out.println(s.replaceAll("\P{Print}", ""));
System.out.println(email.replaceAll("\P{Print}", ""));
}
Output
abc@gmailxe9.com
abc@gmail.comxa0xa0
java non-ascii-characters
add a comment |
I get user input including non-ASCII characters and non-printable characters, such as
xc2d
xa0
xe7
xc3ufffdd
xc3ufffdd
xc2xa0
xc3xa7
xa0xa0
for example:
email : abc@gmail.comxa0xa0
street : 123 Main St.xc2xa0
desired output:
email : abc@gmail.com
street : 123 Main St.
What is the best way to removing them using Java?
I tried the following, but doesn't seem to work
public static void main(String args) throws UnsupportedEncodingException {
String s = "abc@gmail\xe9.com";
String email = "abc@gmail.com\xa0\xa0";
System.out.println(s.replaceAll("\P{Print}", ""));
System.out.println(email.replaceAll("\P{Print}", ""));
}
Output
abc@gmailxe9.com
abc@gmail.comxa0xa0
java non-ascii-characters
why do you want to remove them?
– jtahlborn
Jun 13 '12 at 18:17
1
@jtahlborn, Mongo fails to serialize these values
– daydreamer
Jun 13 '12 at 18:26
@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?
– Raedwald
Nov 26 '18 at 11:11
add a comment |
I get user input including non-ASCII characters and non-printable characters, such as
xc2d
xa0
xe7
xc3ufffdd
xc3ufffdd
xc2xa0
xc3xa7
xa0xa0
for example:
email : abc@gmail.comxa0xa0
street : 123 Main St.xc2xa0
desired output:
email : abc@gmail.com
street : 123 Main St.
What is the best way to removing them using Java?
I tried the following, but doesn't seem to work
public static void main(String args) throws UnsupportedEncodingException {
String s = "abc@gmail\xe9.com";
String email = "abc@gmail.com\xa0\xa0";
System.out.println(s.replaceAll("\P{Print}", ""));
System.out.println(email.replaceAll("\P{Print}", ""));
}
Output
abc@gmailxe9.com
abc@gmail.comxa0xa0
java non-ascii-characters
I get user input including non-ASCII characters and non-printable characters, such as
xc2d
xa0
xe7
xc3ufffdd
xc3ufffdd
xc2xa0
xc3xa7
xa0xa0
for example:
email : abc@gmail.comxa0xa0
street : 123 Main St.xc2xa0
desired output:
email : abc@gmail.com
street : 123 Main St.
What is the best way to removing them using Java?
I tried the following, but doesn't seem to work
public static void main(String args) throws UnsupportedEncodingException {
String s = "abc@gmail\xe9.com";
String email = "abc@gmail.com\xa0\xa0";
System.out.println(s.replaceAll("\P{Print}", ""));
System.out.println(email.replaceAll("\P{Print}", ""));
}
Output
abc@gmailxe9.com
abc@gmail.comxa0xa0
java non-ascii-characters
java non-ascii-characters
edited Nov 26 '18 at 11:06
Raedwald
26.8k2396159
26.8k2396159
asked Jun 13 '12 at 18:14
daydreamerdaydreamer
32.2k134345582
32.2k134345582
why do you want to remove them?
– jtahlborn
Jun 13 '12 at 18:17
1
@jtahlborn, Mongo fails to serialize these values
– daydreamer
Jun 13 '12 at 18:26
@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?
– Raedwald
Nov 26 '18 at 11:11
add a comment |
why do you want to remove them?
– jtahlborn
Jun 13 '12 at 18:17
1
@jtahlborn, Mongo fails to serialize these values
– daydreamer
Jun 13 '12 at 18:26
@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?
– Raedwald
Nov 26 '18 at 11:11
why do you want to remove them?
– jtahlborn
Jun 13 '12 at 18:17
why do you want to remove them?
– jtahlborn
Jun 13 '12 at 18:17
1
1
@jtahlborn, Mongo fails to serialize these values
– daydreamer
Jun 13 '12 at 18:26
@jtahlborn, Mongo fails to serialize these values
– daydreamer
Jun 13 '12 at 18:26
@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?
– Raedwald
Nov 26 '18 at 11:11
@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?
– Raedwald
Nov 26 '18 at 11:11
add a comment |
6 Answers
6
active
oldest
votes
Your requirements are not clear. All characters in a Java String
are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\P{Print}", "");
Here, p{Print}
represents a POSIX character class for printable ASCII characters, while P{Print}
is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal xHH
escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\x\p{XDigit}{2}", "");
The API documentation for the Pattern
class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.
this doesn't work. may be I am doing something incorrect, but not working
– daydreamer
Jun 18 '12 at 18:15
1
@daydreamer Can you provide an SSCCE that shows what is not working?
– erickson
Jun 18 '12 at 18:19
public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0
– daydreamer
Jun 18 '12 at 18:21
@daydreamer\x
doesn't mean anything special in Java source code. \ in aString
orchar
literal is an escape sequence that is replaced with . If you want a Unicode escape, useuXXXX
, where XXXX is the Unicode point, in hexadecimal.
– erickson
Jun 18 '12 at 18:25
@daydreamer E.g.String s = "abc@gmailu00e9.com";
– erickson
Jun 18 '12 at 18:27
|
show 6 more comments
With Google Guava's CharMatcher
, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.
4
note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"
– Andrew White
Aug 26 '14 at 13:47
add a comment |
I know it's maybe late but for future reference:
String clean = str.replaceAll("\P{Print}", "");
Removes all non printable characters, but that includes n
(line feed), t
(tab) and r
(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");
Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)
– Mark Mullin
Feb 1 '16 at 2:15
2
Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P
– Well Smith
Apr 17 '18 at 15:35
Really helped me a lot Thanks @Ivan
– Prinkal Kumar
Jun 12 '18 at 5:01
add a comment |
You can try this code:
public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}
}
return out.toString().replaceAll("\s", " ");
}
It works for me to remove invalid characters from String
.
3
That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?
– Philipp Reichart
Jun 13 '12 at 18:48
add a comment |
You can use java.text.normalizer
add a comment |
Input => "This u7279text u7279is what I need"
Output => "This text is what I need"
If you are trying to remove Unicode characters from a string like above this code will work
Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
cleanData = unicodeMatcher.replaceAll("");
}
add a comment |
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f11020893%2fremove-non-ascii-non-printable-characters-from-a-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your requirements are not clear. All characters in a Java String
are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\P{Print}", "");
Here, p{Print}
represents a POSIX character class for printable ASCII characters, while P{Print}
is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal xHH
escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\x\p{XDigit}{2}", "");
The API documentation for the Pattern
class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.
this doesn't work. may be I am doing something incorrect, but not working
– daydreamer
Jun 18 '12 at 18:15
1
@daydreamer Can you provide an SSCCE that shows what is not working?
– erickson
Jun 18 '12 at 18:19
public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0
– daydreamer
Jun 18 '12 at 18:21
@daydreamer\x
doesn't mean anything special in Java source code. \ in aString
orchar
literal is an escape sequence that is replaced with . If you want a Unicode escape, useuXXXX
, where XXXX is the Unicode point, in hexadecimal.
– erickson
Jun 18 '12 at 18:25
@daydreamer E.g.String s = "abc@gmailu00e9.com";
– erickson
Jun 18 '12 at 18:27
|
show 6 more comments
Your requirements are not clear. All characters in a Java String
are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\P{Print}", "");
Here, p{Print}
represents a POSIX character class for printable ASCII characters, while P{Print}
is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal xHH
escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\x\p{XDigit}{2}", "");
The API documentation for the Pattern
class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.
this doesn't work. may be I am doing something incorrect, but not working
– daydreamer
Jun 18 '12 at 18:15
1
@daydreamer Can you provide an SSCCE that shows what is not working?
– erickson
Jun 18 '12 at 18:19
public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0
– daydreamer
Jun 18 '12 at 18:21
@daydreamer\x
doesn't mean anything special in Java source code. \ in aString
orchar
literal is an escape sequence that is replaced with . If you want a Unicode escape, useuXXXX
, where XXXX is the Unicode point, in hexadecimal.
– erickson
Jun 18 '12 at 18:25
@daydreamer E.g.String s = "abc@gmailu00e9.com";
– erickson
Jun 18 '12 at 18:27
|
show 6 more comments
Your requirements are not clear. All characters in a Java String
are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\P{Print}", "");
Here, p{Print}
represents a POSIX character class for printable ASCII characters, while P{Print}
is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal xHH
escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\x\p{XDigit}{2}", "");
The API documentation for the Pattern
class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.
Your requirements are not clear. All characters in a Java String
are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\P{Print}", "");
Here, p{Print}
represents a POSIX character class for printable ASCII characters, while P{Print}
is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal xHH
escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\x\p{XDigit}{2}", "");
The API documentation for the Pattern
class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.
edited Jun 7 '16 at 16:23
answered Jun 13 '12 at 18:39
ericksonerickson
224k42334431
224k42334431
this doesn't work. may be I am doing something incorrect, but not working
– daydreamer
Jun 18 '12 at 18:15
1
@daydreamer Can you provide an SSCCE that shows what is not working?
– erickson
Jun 18 '12 at 18:19
public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0
– daydreamer
Jun 18 '12 at 18:21
@daydreamer\x
doesn't mean anything special in Java source code. \ in aString
orchar
literal is an escape sequence that is replaced with . If you want a Unicode escape, useuXXXX
, where XXXX is the Unicode point, in hexadecimal.
– erickson
Jun 18 '12 at 18:25
@daydreamer E.g.String s = "abc@gmailu00e9.com";
– erickson
Jun 18 '12 at 18:27
|
show 6 more comments
this doesn't work. may be I am doing something incorrect, but not working
– daydreamer
Jun 18 '12 at 18:15
1
@daydreamer Can you provide an SSCCE that shows what is not working?
– erickson
Jun 18 '12 at 18:19
public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0
– daydreamer
Jun 18 '12 at 18:21
@daydreamer\x
doesn't mean anything special in Java source code. \ in aString
orchar
literal is an escape sequence that is replaced with . If you want a Unicode escape, useuXXXX
, where XXXX is the Unicode point, in hexadecimal.
– erickson
Jun 18 '12 at 18:25
@daydreamer E.g.String s = "abc@gmailu00e9.com";
– erickson
Jun 18 '12 at 18:27
this doesn't work. may be I am doing something incorrect, but not working
– daydreamer
Jun 18 '12 at 18:15
this doesn't work. may be I am doing something incorrect, but not working
– daydreamer
Jun 18 '12 at 18:15
1
1
@daydreamer Can you provide an SSCCE that shows what is not working?
– erickson
Jun 18 '12 at 18:19
@daydreamer Can you provide an SSCCE that shows what is not working?
– erickson
Jun 18 '12 at 18:19
public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0
– daydreamer
Jun 18 '12 at 18:21
public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0
– daydreamer
Jun 18 '12 at 18:21
@daydreamer
\x
doesn't mean anything special in Java source code. \ in a String
or char
literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX
, where XXXX is the Unicode point, in hexadecimal.– erickson
Jun 18 '12 at 18:25
@daydreamer
\x
doesn't mean anything special in Java source code. \ in a String
or char
literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX
, where XXXX is the Unicode point, in hexadecimal.– erickson
Jun 18 '12 at 18:25
@daydreamer E.g.
String s = "abc@gmailu00e9.com";
– erickson
Jun 18 '12 at 18:27
@daydreamer E.g.
String s = "abc@gmailu00e9.com";
– erickson
Jun 18 '12 at 18:27
|
show 6 more comments
With Google Guava's CharMatcher
, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.
4
note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"
– Andrew White
Aug 26 '14 at 13:47
add a comment |
With Google Guava's CharMatcher
, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.
4
note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"
– Andrew White
Aug 26 '14 at 13:47
add a comment |
With Google Guava's CharMatcher
, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.
With Google Guava's CharMatcher
, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.
edited Jun 13 '12 at 19:03
answered Jun 13 '12 at 18:47
Philipp ReichartPhilipp Reichart
18.6k55063
18.6k55063
4
note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"
– Andrew White
Aug 26 '14 at 13:47
add a comment |
4
note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"
– Andrew White
Aug 26 '14 at 13:47
4
4
note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"
– Andrew White
Aug 26 '14 at 13:47
note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"
– Andrew White
Aug 26 '14 at 13:47
add a comment |
I know it's maybe late but for future reference:
String clean = str.replaceAll("\P{Print}", "");
Removes all non printable characters, but that includes n
(line feed), t
(tab) and r
(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");
Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)
– Mark Mullin
Feb 1 '16 at 2:15
2
Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P
– Well Smith
Apr 17 '18 at 15:35
Really helped me a lot Thanks @Ivan
– Prinkal Kumar
Jun 12 '18 at 5:01
add a comment |
I know it's maybe late but for future reference:
String clean = str.replaceAll("\P{Print}", "");
Removes all non printable characters, but that includes n
(line feed), t
(tab) and r
(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");
Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)
– Mark Mullin
Feb 1 '16 at 2:15
2
Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P
– Well Smith
Apr 17 '18 at 15:35
Really helped me a lot Thanks @Ivan
– Prinkal Kumar
Jun 12 '18 at 5:01
add a comment |
I know it's maybe late but for future reference:
String clean = str.replaceAll("\P{Print}", "");
Removes all non printable characters, but that includes n
(line feed), t
(tab) and r
(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");
I know it's maybe late but for future reference:
String clean = str.replaceAll("\P{Print}", "");
Removes all non printable characters, but that includes n
(line feed), t
(tab) and r
(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");
answered Jul 15 '15 at 7:33
Ivan PavićIvan Pavić
348318
348318
Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)
– Mark Mullin
Feb 1 '16 at 2:15
2
Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P
– Well Smith
Apr 17 '18 at 15:35
Really helped me a lot Thanks @Ivan
– Prinkal Kumar
Jun 12 '18 at 5:01
add a comment |
Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)
– Mark Mullin
Feb 1 '16 at 2:15
2
Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P
– Well Smith
Apr 17 '18 at 15:35
Really helped me a lot Thanks @Ivan
– Prinkal Kumar
Jun 12 '18 at 5:01
Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)
– Mark Mullin
Feb 1 '16 at 2:15
Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)
– Mark Mullin
Feb 1 '16 at 2:15
2
2
Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P
– Well Smith
Apr 17 '18 at 15:35
Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P
– Well Smith
Apr 17 '18 at 15:35
Really helped me a lot Thanks @Ivan
– Prinkal Kumar
Jun 12 '18 at 5:01
Really helped me a lot Thanks @Ivan
– Prinkal Kumar
Jun 12 '18 at 5:01
add a comment |
You can try this code:
public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}
}
return out.toString().replaceAll("\s", " ");
}
It works for me to remove invalid characters from String
.
3
That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?
– Philipp Reichart
Jun 13 '12 at 18:48
add a comment |
You can try this code:
public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}
}
return out.toString().replaceAll("\s", " ");
}
It works for me to remove invalid characters from String
.
3
That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?
– Philipp Reichart
Jun 13 '12 at 18:48
add a comment |
You can try this code:
public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}
}
return out.toString().replaceAll("\s", " ");
}
It works for me to remove invalid characters from String
.
You can try this code:
public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}
}
return out.toString().replaceAll("\s", " ");
}
It works for me to remove invalid characters from String
.
answered Jun 13 '12 at 18:17
Paulius MatulionisPaulius Matulionis
15.8k1890131
15.8k1890131
3
That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?
– Philipp Reichart
Jun 13 '12 at 18:48
add a comment |
3
That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?
– Philipp Reichart
Jun 13 '12 at 18:48
3
3
That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?
– Philipp Reichart
Jun 13 '12 at 18:48
That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?
– Philipp Reichart
Jun 13 '12 at 18:48
add a comment |
You can use java.text.normalizer
add a comment |
You can use java.text.normalizer
add a comment |
You can use java.text.normalizer
You can use java.text.normalizer
answered Jun 13 '12 at 18:17
exceptionexception
7251822
7251822
add a comment |
add a comment |
Input => "This u7279text u7279is what I need"
Output => "This text is what I need"
If you are trying to remove Unicode characters from a string like above this code will work
Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
cleanData = unicodeMatcher.replaceAll("");
}
add a comment |
Input => "This u7279text u7279is what I need"
Output => "This text is what I need"
If you are trying to remove Unicode characters from a string like above this code will work
Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
cleanData = unicodeMatcher.replaceAll("");
}
add a comment |
Input => "This u7279text u7279is what I need"
Output => "This text is what I need"
If you are trying to remove Unicode characters from a string like above this code will work
Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
cleanData = unicodeMatcher.replaceAll("");
}
Input => "This u7279text u7279is what I need"
Output => "This text is what I need"
If you are trying to remove Unicode characters from a string like above this code will work
Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
cleanData = unicodeMatcher.replaceAll("");
}
answered May 10 '17 at 15:04
Sivaram KandappanSivaram Kandappan
11
11
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f11020893%2fremove-non-ascii-non-printable-characters-from-a-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
why do you want to remove them?
– jtahlborn
Jun 13 '12 at 18:17
1
@jtahlborn, Mongo fails to serialize these values
– daydreamer
Jun 13 '12 at 18:26
@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?
– Raedwald
Nov 26 '18 at 11:11