Remove non-ASCII non-printable characters from a String

I get user input including non-ASCII characters and non-printable characters, such as

xc2d

xa0

xe7

xc3ufffdd

xc3ufffdd

xc2xa0

xc3xa7

xa0xa0

for example:

email : abc@gmail.comxa0xa0

street : 123 Main St.xc2xa0

desired output:

  email : abc@gmail.com

  street : 123 Main St.

What is the best way to removing them using Java?

I tried the following, but doesn't seem to work

public static void main(String args) throws UnsupportedEncodingException {

        String s = "abc@gmail\xe9.com";

        String email = "abc@gmail.com\xa0\xa0";



        System.out.println(s.replaceAll("\P{Print}", ""));

        System.out.println(email.replaceAll("\P{Print}", ""));

    }

Output

abc@gmailxe9.com

abc@gmail.comxa0xa0

edited Nov 26 '18 at 11:06

Raedwald

26.8k2396159

asked Jun 13 '12 at 18:14

daydreamer

32.2k134345582

why do you want to remove them?

– jtahlborn
Jun 13 '12 at 18:17

1

@jtahlborn, Mongo fails to serialize these values

– daydreamer
Jun 13 '12 at 18:26

@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

– Raedwald
Nov 26 '18 at 11:11

add a comment |

I get user input including non-ASCII characters and non-printable characters, such as

xc2d

xa0

xe7

xc3ufffdd

xc3ufffdd

xc2xa0

xc3xa7

xa0xa0

for example:

email : abc@gmail.comxa0xa0

street : 123 Main St.xc2xa0

desired output:

  email : abc@gmail.com

  street : 123 Main St.

What is the best way to removing them using Java?

I tried the following, but doesn't seem to work

public static void main(String args) throws UnsupportedEncodingException {

        String s = "abc@gmail\xe9.com";

        String email = "abc@gmail.com\xa0\xa0";



        System.out.println(s.replaceAll("\P{Print}", ""));

        System.out.println(email.replaceAll("\P{Print}", ""));

    }

Output

abc@gmailxe9.com

abc@gmail.comxa0xa0

edited Nov 26 '18 at 11:06

Raedwald

26.8k2396159

asked Jun 13 '12 at 18:14

daydreamer

32.2k134345582

why do you want to remove them?

– jtahlborn
Jun 13 '12 at 18:17

1

@jtahlborn, Mongo fails to serialize these values

– daydreamer
Jun 13 '12 at 18:26

@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

– Raedwald
Nov 26 '18 at 11:11

add a comment |

I get user input including non-ASCII characters and non-printable characters, such as

xc2d

xa0

xe7

xc3ufffdd

xc3ufffdd

xc2xa0

xc3xa7

xa0xa0

for example:

email : abc@gmail.comxa0xa0

street : 123 Main St.xc2xa0

desired output:

  email : abc@gmail.com

  street : 123 Main St.

What is the best way to removing them using Java?

I tried the following, but doesn't seem to work

public static void main(String args) throws UnsupportedEncodingException {

        String s = "abc@gmail\xe9.com";

        String email = "abc@gmail.com\xa0\xa0";



        System.out.println(s.replaceAll("\P{Print}", ""));

        System.out.println(email.replaceAll("\P{Print}", ""));

    }

Output

abc@gmailxe9.com

abc@gmail.comxa0xa0

edited Nov 26 '18 at 11:06

Raedwald

26.8k2396159

asked Jun 13 '12 at 18:14

daydreamer

32.2k134345582

I get user input including non-ASCII characters and non-printable characters, such as

xc2d

xa0

xe7

xc3ufffdd

xc3ufffdd

xc2xa0

xc3xa7

xa0xa0

for example:

email : abc@gmail.comxa0xa0

street : 123 Main St.xc2xa0

desired output:

  email : abc@gmail.com

  street : 123 Main St.

What is the best way to removing them using Java?

I tried the following, but doesn't seem to work

public static void main(String args) throws UnsupportedEncodingException {

        String s = "abc@gmail\xe9.com";

        String email = "abc@gmail.com\xa0\xa0";



        System.out.println(s.replaceAll("\P{Print}", ""));

        System.out.println(email.replaceAll("\P{Print}", ""));

    }

Output

abc@gmailxe9.com

abc@gmail.comxa0xa0

java non-ascii-characters

edited Nov 26 '18 at 11:06

Raedwald

26.8k2396159

asked Jun 13 '12 at 18:14

daydreamer

32.2k134345582

edited Nov 26 '18 at 11:06

Raedwald

26.8k2396159

asked Jun 13 '12 at 18:14

daydreamer

32.2k134345582

edited Nov 26 '18 at 11:06

Raedwald

26.8k2396159

edited Nov 26 '18 at 11:06

Raedwald

26.8k2396159

edited Nov 26 '18 at 11:06

Raedwald

26.8k2396159

asked Jun 13 '12 at 18:14

daydreamer

32.2k134345582

asked Jun 13 '12 at 18:14

daydreamer

32.2k134345582

asked Jun 13 '12 at 18:14

daydreamer

32.2k134345582

why do you want to remove them?

– jtahlborn
Jun 13 '12 at 18:17

1

@jtahlborn, Mongo fails to serialize these values

– daydreamer
Jun 13 '12 at 18:26

@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

– Raedwald
Nov 26 '18 at 11:11

add a comment |

why do you want to remove them?

– jtahlborn
Jun 13 '12 at 18:17

1

@jtahlborn, Mongo fails to serialize these values

– daydreamer
Jun 13 '12 at 18:26

@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

– Raedwald
Nov 26 '18 at 11:11

why do you want to remove them?

– jtahlborn
Jun 13 '12 at 18:17

@jtahlborn, Mongo fails to serialize these values

– daydreamer
Jun 13 '12 at 18:26

@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

– Raedwald
Nov 26 '18 at 11:11

add a comment |

6 Answers
6

active

oldest

votes

Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.

String clean = str.replaceAll("\P{Print}", "");

Here, p{Print} represents a POSIX character class for printable ASCII characters, while P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)

Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.

This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.

That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\x\p{XDigit}{2}", "");

The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.

edited Jun 7 '16 at 16:23

answered Jun 13 '12 at 18:39

erickson

224k42334431

this doesn't work. may be I am doing something incorrect, but not working

– daydreamer
Jun 18 '12 at 18:15

1

@daydreamer Can you provide an SSCCE that shows what is not working?

– erickson
Jun 18 '12 at 18:19

public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

– daydreamer
Jun 18 '12 at 18:21

@daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

– erickson
Jun 18 '12 at 18:25

@daydreamer E.g. String s = "abc@gmailu00e9.com";

– erickson
Jun 18 '12 at 18:27

|
show 6 more comments

With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom(input);

String clean = CharMatcher.ASCII.retainFrom(printable);

Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.

edited Jun 13 '12 at 19:03

answered Jun 13 '12 at 18:47

Philipp Reichart

18.6k55063

4

note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

– Andrew White
Aug 26 '14 at 13:47

add a comment |

I know it's maybe late but for future reference:

String clean = str.replaceAll("\P{Print}", "");

Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.

For that problem use inverted logic:

String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");

answered Jul 15 '15 at 7:33

Ivan Pavić

348318

Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

– Mark Mullin
Feb 1 '16 at 2:15

2

Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

– Well Smith
Apr 17 '18 at 15:35

Really helped me a lot Thanks @Ivan

– Prinkal Kumar
Jun 12 '18 at 5:01

add a comment |

You can try this code:

public String cleanInvalidCharacters(String in) {

    StringBuilder out = new StringBuilder();

    char current;

    if (in == null || ("".equals(in))) {

        return "";

    }

    for (int i = 0; i < in.length(); i++) {

        current = in.charAt(i);

        if ((current == 0x9)

                || (current == 0xA)

                || (current == 0xD)

                || ((current >= 0x20) && (current <= 0xD7FF))

                || ((current >= 0xE000) && (current <= 0xFFFD))

                || ((current >= 0x10000) && (current <= 0x10FFFF))) {

            out.append(current);

        }



    }

    return out.toString().replaceAll("\s", " ");

}

It works for me to remove invalid characters from String.

answered Jun 13 '12 at 18:17

Paulius Matulionis

15.8k1890131

3

That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

– Philipp Reichart
Jun 13 '12 at 18:48

add a comment |

You can use java.text.normalizer

answered Jun 13 '12 at 18:17

exception

7251822

add a comment |

Input => "This u7279text u7279is what I need"
Output => "This text is what I need"

If you are trying to remove Unicode characters from a string like above this code will work

Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");

Matcher unicodeMatcher = unicodeChars.matcher(data);

String cleanData = null;

if (unicodeMatcher.find()) {

    cleanData = unicodeMatcher.replaceAll("");

}

answered May 10 '17 at 15:04

Sivaram Kandappan

add a comment |

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f11020893%2fremove-non-ascii-non-printable-characters-from-a-string%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

String clean = str.replaceAll("\P{Print}", "");

That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\x\p{XDigit}{2}", "");

edited Jun 7 '16 at 16:23

answered Jun 13 '12 at 18:39

erickson

224k42334431

this doesn't work. may be I am doing something incorrect, but not working

– daydreamer
Jun 18 '12 at 18:15

1

@daydreamer Can you provide an SSCCE that shows what is not working?

– erickson
Jun 18 '12 at 18:19

public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

– daydreamer
Jun 18 '12 at 18:21

@daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

– erickson
Jun 18 '12 at 18:25

@daydreamer E.g. String s = "abc@gmailu00e9.com";

– erickson
Jun 18 '12 at 18:27

|
show 6 more comments

String clean = str.replaceAll("\P{Print}", "");

That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\x\p{XDigit}{2}", "");

edited Jun 7 '16 at 16:23

answered Jun 13 '12 at 18:39

erickson

224k42334431

this doesn't work. may be I am doing something incorrect, but not working

– daydreamer
Jun 18 '12 at 18:15

1

@daydreamer Can you provide an SSCCE that shows what is not working?

– erickson
Jun 18 '12 at 18:19

public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

– daydreamer
Jun 18 '12 at 18:21

@daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

– erickson
Jun 18 '12 at 18:25

@daydreamer E.g. String s = "abc@gmailu00e9.com";

– erickson
Jun 18 '12 at 18:27

|
show 6 more comments

String clean = str.replaceAll("\P{Print}", "");

That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\x\p{XDigit}{2}", "");

edited Jun 7 '16 at 16:23

answered Jun 13 '12 at 18:39

erickson

224k42334431

String clean = str.replaceAll("\P{Print}", "");

That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\x\p{XDigit}{2}", "");

edited Jun 7 '16 at 16:23

answered Jun 13 '12 at 18:39

erickson

224k42334431

edited Jun 7 '16 at 16:23

answered Jun 13 '12 at 18:39

erickson

224k42334431

answered Jun 13 '12 at 18:39

erickson

224k42334431

answered Jun 13 '12 at 18:39

erickson

224k42334431

this doesn't work. may be I am doing something incorrect, but not working

– daydreamer
Jun 18 '12 at 18:15

1

@daydreamer Can you provide an SSCCE that shows what is not working?

– erickson
Jun 18 '12 at 18:19

public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

– daydreamer
Jun 18 '12 at 18:21

@daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

– erickson
Jun 18 '12 at 18:25

@daydreamer E.g. String s = "abc@gmailu00e9.com";

– erickson
Jun 18 '12 at 18:27

|
show 6 more comments

this doesn't work. may be I am doing something incorrect, but not working

– daydreamer
Jun 18 '12 at 18:15

1

@daydreamer Can you provide an SSCCE that shows what is not working?

– erickson
Jun 18 '12 at 18:19

public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

– daydreamer
Jun 18 '12 at 18:21

@daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

– erickson
Jun 18 '12 at 18:25

@daydreamer E.g. String s = "abc@gmailu00e9.com";

– erickson
Jun 18 '12 at 18:27

this doesn't work. may be I am doing something incorrect, but not working

– daydreamer
Jun 18 '12 at 18:15

@daydreamer Can you provide an SSCCE that shows what is not working?

– erickson
Jun 18 '12 at 18:19

public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

– daydreamer
Jun 18 '12 at 18:21

@daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

– erickson
Jun 18 '12 at 18:25

@daydreamer E.g. String s = "abc@gmailu00e9.com";

– erickson
Jun 18 '12 at 18:27

|
show 6 more comments

With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom(input);

String clean = CharMatcher.ASCII.retainFrom(printable);

Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.

edited Jun 13 '12 at 19:03

answered Jun 13 '12 at 18:47

Philipp Reichart

18.6k55063

4

note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

– Andrew White
Aug 26 '14 at 13:47

add a comment |

With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom(input);

String clean = CharMatcher.ASCII.retainFrom(printable);

Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.

edited Jun 13 '12 at 19:03

answered Jun 13 '12 at 18:47

Philipp Reichart

18.6k55063

4

note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

– Andrew White
Aug 26 '14 at 13:47

add a comment |

With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom(input);

String clean = CharMatcher.ASCII.retainFrom(printable);

Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.

edited Jun 13 '12 at 19:03

answered Jun 13 '12 at 18:47

Philipp Reichart

18.6k55063

With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom(input);

String clean = CharMatcher.ASCII.retainFrom(printable);

Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.

edited Jun 13 '12 at 19:03

answered Jun 13 '12 at 18:47

Philipp Reichart

18.6k55063

edited Jun 13 '12 at 19:03

answered Jun 13 '12 at 18:47

Philipp Reichart

18.6k55063

answered Jun 13 '12 at 18:47

Philipp Reichart

18.6k55063

answered Jun 13 '12 at 18:47

Philipp Reichart

18.6k55063

4

note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

– Andrew White
Aug 26 '14 at 13:47

add a comment |

4

note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

– Andrew White
Aug 26 '14 at 13:47

note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

– Andrew White
Aug 26 '14 at 13:47

add a comment |

I know it's maybe late but for future reference:

String clean = str.replaceAll("\P{Print}", "");

Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.

For that problem use inverted logic:

String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");

answered Jul 15 '15 at 7:33

Ivan Pavić

348318

Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

– Mark Mullin
Feb 1 '16 at 2:15

2

Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

– Well Smith
Apr 17 '18 at 15:35

Really helped me a lot Thanks @Ivan

– Prinkal Kumar
Jun 12 '18 at 5:01

add a comment |

I know it's maybe late but for future reference:

String clean = str.replaceAll("\P{Print}", "");

Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.

For that problem use inverted logic:

String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");

answered Jul 15 '15 at 7:33

Ivan Pavić

348318

Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

– Mark Mullin
Feb 1 '16 at 2:15

2

Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

– Well Smith
Apr 17 '18 at 15:35

Really helped me a lot Thanks @Ivan

– Prinkal Kumar
Jun 12 '18 at 5:01

add a comment |

I know it's maybe late but for future reference:

String clean = str.replaceAll("\P{Print}", "");

Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.

For that problem use inverted logic:

String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");

answered Jul 15 '15 at 7:33

Ivan Pavić

348318

I know it's maybe late but for future reference:

String clean = str.replaceAll("\P{Print}", "");

Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.

For that problem use inverted logic:

String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");

answered Jul 15 '15 at 7:33

Ivan Pavić

348318

answered Jul 15 '15 at 7:33

Ivan Pavić

348318

answered Jul 15 '15 at 7:33

Ivan Pavić

348318

answered Jul 15 '15 at 7:33

Ivan Pavić

348318

Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

– Mark Mullin
Feb 1 '16 at 2:15

2

Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

– Well Smith
Apr 17 '18 at 15:35

Really helped me a lot Thanks @Ivan

– Prinkal Kumar
Jun 12 '18 at 5:01

add a comment |

Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

– Mark Mullin
Feb 1 '16 at 2:15

2

Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

– Well Smith
Apr 17 '18 at 15:35

Really helped me a lot Thanks @Ivan

– Prinkal Kumar
Jun 12 '18 at 5:01

Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

– Mark Mullin
Feb 1 '16 at 2:15

Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

– Well Smith
Apr 17 '18 at 15:35

Really helped me a lot Thanks @Ivan

– Prinkal Kumar
Jun 12 '18 at 5:01

add a comment |

You can try this code:

public String cleanInvalidCharacters(String in) {

    StringBuilder out = new StringBuilder();

    char current;

    if (in == null || ("".equals(in))) {

        return "";

    }

    for (int i = 0; i < in.length(); i++) {

        current = in.charAt(i);

        if ((current == 0x9)

                || (current == 0xA)

                || (current == 0xD)

                || ((current >= 0x20) && (current <= 0xD7FF))

                || ((current >= 0xE000) && (current <= 0xFFFD))

                || ((current >= 0x10000) && (current <= 0x10FFFF))) {

            out.append(current);

        }



    }

    return out.toString().replaceAll("\s", " ");

}

It works for me to remove invalid characters from String.

answered Jun 13 '12 at 18:17

Paulius Matulionis

15.8k1890131

3

That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

– Philipp Reichart
Jun 13 '12 at 18:48

add a comment |

You can try this code:

public String cleanInvalidCharacters(String in) {

    StringBuilder out = new StringBuilder();

    char current;

    if (in == null || ("".equals(in))) {

        return "";

    }

    for (int i = 0; i < in.length(); i++) {

        current = in.charAt(i);

        if ((current == 0x9)

                || (current == 0xA)

                || (current == 0xD)

                || ((current >= 0x20) && (current <= 0xD7FF))

                || ((current >= 0xE000) && (current <= 0xFFFD))

                || ((current >= 0x10000) && (current <= 0x10FFFF))) {

            out.append(current);

        }



    }

    return out.toString().replaceAll("\s", " ");

}

It works for me to remove invalid characters from String.

answered Jun 13 '12 at 18:17

Paulius Matulionis

15.8k1890131

3

That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

– Philipp Reichart
Jun 13 '12 at 18:48

add a comment |

You can try this code:

public String cleanInvalidCharacters(String in) {

    StringBuilder out = new StringBuilder();

    char current;

    if (in == null || ("".equals(in))) {

        return "";

    }

    for (int i = 0; i < in.length(); i++) {

        current = in.charAt(i);

        if ((current == 0x9)

                || (current == 0xA)

                || (current == 0xD)

                || ((current >= 0x20) && (current <= 0xD7FF))

                || ((current >= 0xE000) && (current <= 0xFFFD))

                || ((current >= 0x10000) && (current <= 0x10FFFF))) {

            out.append(current);

        }



    }

    return out.toString().replaceAll("\s", " ");

}

It works for me to remove invalid characters from String.

answered Jun 13 '12 at 18:17

Paulius Matulionis

15.8k1890131

You can try this code:

public String cleanInvalidCharacters(String in) {

    StringBuilder out = new StringBuilder();

    char current;

    if (in == null || ("".equals(in))) {

        return "";

    }

    for (int i = 0; i < in.length(); i++) {

        current = in.charAt(i);

        if ((current == 0x9)

                || (current == 0xA)

                || (current == 0xD)

                || ((current >= 0x20) && (current <= 0xD7FF))

                || ((current >= 0xE000) && (current <= 0xFFFD))

                || ((current >= 0x10000) && (current <= 0x10FFFF))) {

            out.append(current);

        }



    }

    return out.toString().replaceAll("\s", " ");

}

It works for me to remove invalid characters from String.

answered Jun 13 '12 at 18:17

Paulius Matulionis

15.8k1890131

answered Jun 13 '12 at 18:17

Paulius Matulionis

15.8k1890131

answered Jun 13 '12 at 18:17

Paulius Matulionis

15.8k1890131

answered Jun 13 '12 at 18:17

Paulius Matulionis

15.8k1890131

3

That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

– Philipp Reichart
Jun 13 '12 at 18:48

add a comment |

3

That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

– Philipp Reichart
Jun 13 '12 at 18:48

That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

– Philipp Reichart
Jun 13 '12 at 18:48

add a comment |

You can use java.text.normalizer

answered Jun 13 '12 at 18:17

exception

7251822

add a comment |

You can use java.text.normalizer

answered Jun 13 '12 at 18:17

exception

7251822

add a comment |

You can use java.text.normalizer

answered Jun 13 '12 at 18:17

exception

7251822

You can use java.text.normalizer

answered Jun 13 '12 at 18:17

exception

7251822

answered Jun 13 '12 at 18:17

exception

7251822

answered Jun 13 '12 at 18:17

exception

7251822

answered Jun 13 '12 at 18:17

exception

7251822

add a comment |

Input => "This u7279text u7279is what I need"
Output => "This text is what I need"

If you are trying to remove Unicode characters from a string like above this code will work

Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");

Matcher unicodeMatcher = unicodeChars.matcher(data);

String cleanData = null;

if (unicodeMatcher.find()) {

    cleanData = unicodeMatcher.replaceAll("");

}

answered May 10 '17 at 15:04

Sivaram Kandappan

add a comment |

Input => "This u7279text u7279is what I need"
Output => "This text is what I need"

If you are trying to remove Unicode characters from a string like above this code will work

Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");

Matcher unicodeMatcher = unicodeChars.matcher(data);

String cleanData = null;

if (unicodeMatcher.find()) {

    cleanData = unicodeMatcher.replaceAll("");

}

answered May 10 '17 at 15:04

Sivaram Kandappan

add a comment |

Input => "This u7279text u7279is what I need"
Output => "This text is what I need"

If you are trying to remove Unicode characters from a string like above this code will work

Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");

Matcher unicodeMatcher = unicodeChars.matcher(data);

String cleanData = null;

if (unicodeMatcher.find()) {

    cleanData = unicodeMatcher.replaceAll("");

}

answered May 10 '17 at 15:04

Sivaram Kandappan

Input => "This u7279text u7279is what I need"
Output => "This text is what I need"

If you are trying to remove Unicode characters from a string like above this code will work

Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");

Matcher unicodeMatcher = unicodeChars.matcher(data);

String cleanData = null;

if (unicodeMatcher.find()) {

    cleanData = unicodeMatcher.replaceAll("");

}

answered May 10 '17 at 15:04

Sivaram Kandappan

answered May 10 '17 at 15:04

Sivaram Kandappan

answered May 10 '17 at 15:04

Sivaram Kandappan

answered May 10 '17 at 15:04

Sivaram Kandappan

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

jEVWoK y B,Ce8QUZAZe7kEz563NukIGF,9tRlMvNA,9D2CPjFlxXpxjuu

搜尋此網誌

Ytukyg