Remove non-ASCII non-printable characters from a String












14















I get user input including non-ASCII characters and non-printable characters, such as



xc2d
xa0
xe7
xc3ufffdd
xc3ufffdd
xc2xa0
xc3xa7
xa0xa0


for example:



email : abc@gmail.comxa0xa0
street : 123 Main St.xc2xa0


desired output:



  email : abc@gmail.com
street : 123 Main St.


What is the best way to removing them using Java?

I tried the following, but doesn't seem to work



public static void main(String args) throws UnsupportedEncodingException {
String s = "abc@gmail\xe9.com";
String email = "abc@gmail.com\xa0\xa0";

System.out.println(s.replaceAll("\P{Print}", ""));
System.out.println(email.replaceAll("\P{Print}", ""));
}


Output



abc@gmailxe9.com
abc@gmail.comxa0xa0









share|improve this question

























  • why do you want to remove them?

    – jtahlborn
    Jun 13 '12 at 18:17






  • 1





    @jtahlborn, Mongo fails to serialize these values

    – daydreamer
    Jun 13 '12 at 18:26











  • @daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

    – Raedwald
    Nov 26 '18 at 11:11
















14















I get user input including non-ASCII characters and non-printable characters, such as



xc2d
xa0
xe7
xc3ufffdd
xc3ufffdd
xc2xa0
xc3xa7
xa0xa0


for example:



email : abc@gmail.comxa0xa0
street : 123 Main St.xc2xa0


desired output:



  email : abc@gmail.com
street : 123 Main St.


What is the best way to removing them using Java?

I tried the following, but doesn't seem to work



public static void main(String args) throws UnsupportedEncodingException {
String s = "abc@gmail\xe9.com";
String email = "abc@gmail.com\xa0\xa0";

System.out.println(s.replaceAll("\P{Print}", ""));
System.out.println(email.replaceAll("\P{Print}", ""));
}


Output



abc@gmailxe9.com
abc@gmail.comxa0xa0









share|improve this question

























  • why do you want to remove them?

    – jtahlborn
    Jun 13 '12 at 18:17






  • 1





    @jtahlborn, Mongo fails to serialize these values

    – daydreamer
    Jun 13 '12 at 18:26











  • @daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

    – Raedwald
    Nov 26 '18 at 11:11














14












14








14


12






I get user input including non-ASCII characters and non-printable characters, such as



xc2d
xa0
xe7
xc3ufffdd
xc3ufffdd
xc2xa0
xc3xa7
xa0xa0


for example:



email : abc@gmail.comxa0xa0
street : 123 Main St.xc2xa0


desired output:



  email : abc@gmail.com
street : 123 Main St.


What is the best way to removing them using Java?

I tried the following, but doesn't seem to work



public static void main(String args) throws UnsupportedEncodingException {
String s = "abc@gmail\xe9.com";
String email = "abc@gmail.com\xa0\xa0";

System.out.println(s.replaceAll("\P{Print}", ""));
System.out.println(email.replaceAll("\P{Print}", ""));
}


Output



abc@gmailxe9.com
abc@gmail.comxa0xa0









share|improve this question
















I get user input including non-ASCII characters and non-printable characters, such as



xc2d
xa0
xe7
xc3ufffdd
xc3ufffdd
xc2xa0
xc3xa7
xa0xa0


for example:



email : abc@gmail.comxa0xa0
street : 123 Main St.xc2xa0


desired output:



  email : abc@gmail.com
street : 123 Main St.


What is the best way to removing them using Java?

I tried the following, but doesn't seem to work



public static void main(String args) throws UnsupportedEncodingException {
String s = "abc@gmail\xe9.com";
String email = "abc@gmail.com\xa0\xa0";

System.out.println(s.replaceAll("\P{Print}", ""));
System.out.println(email.replaceAll("\P{Print}", ""));
}


Output



abc@gmailxe9.com
abc@gmail.comxa0xa0






java non-ascii-characters






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 26 '18 at 11:06









Raedwald

26.8k2396159




26.8k2396159










asked Jun 13 '12 at 18:14









daydreamerdaydreamer

32.2k134345582




32.2k134345582













  • why do you want to remove them?

    – jtahlborn
    Jun 13 '12 at 18:17






  • 1





    @jtahlborn, Mongo fails to serialize these values

    – daydreamer
    Jun 13 '12 at 18:26











  • @daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

    – Raedwald
    Nov 26 '18 at 11:11



















  • why do you want to remove them?

    – jtahlborn
    Jun 13 '12 at 18:17






  • 1





    @jtahlborn, Mongo fails to serialize these values

    – daydreamer
    Jun 13 '12 at 18:26











  • @daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

    – Raedwald
    Nov 26 '18 at 11:11

















why do you want to remove them?

– jtahlborn
Jun 13 '12 at 18:17





why do you want to remove them?

– jtahlborn
Jun 13 '12 at 18:17




1




1





@jtahlborn, Mongo fails to serialize these values

– daydreamer
Jun 13 '12 at 18:26





@jtahlborn, Mongo fails to serialize these values

– daydreamer
Jun 13 '12 at 18:26













@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

– Raedwald
Nov 26 '18 at 11:11





@daydreamer [citation needed] xc2d is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text?

– Raedwald
Nov 26 '18 at 11:11












6 Answers
6






active

oldest

votes


















34














Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.



String clean = str.replaceAll("\P{Print}", "");


Here, p{Print} represents a POSIX character class for printable ASCII characters, while P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)





Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.



This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.



That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.



String clean = str.replaceAll("\\x\p{XDigit}{2}", "");




The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.






share|improve this answer


























  • this doesn't work. may be I am doing something incorrect, but not working

    – daydreamer
    Jun 18 '12 at 18:15






  • 1





    @daydreamer Can you provide an SSCCE that shows what is not working?

    – erickson
    Jun 18 '12 at 18:19











  • public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

    – daydreamer
    Jun 18 '12 at 18:21











  • @daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

    – erickson
    Jun 18 '12 at 18:25













  • @daydreamer E.g. String s = "abc@gmailu00e9.com";

    – erickson
    Jun 18 '12 at 18:27



















14














With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:



String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);


Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.






share|improve this answer





















  • 4





    note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

    – Andrew White
    Aug 26 '14 at 13:47



















10














I know it's maybe late but for future reference:



String clean = str.replaceAll("\P{Print}", "");


Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.



For that problem use inverted logic:



String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");





share|improve this answer
























  • Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

    – Mark Mullin
    Feb 1 '16 at 2:15






  • 2





    Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

    – Well Smith
    Apr 17 '18 at 15:35













  • Really helped me a lot Thanks @Ivan

    – Prinkal Kumar
    Jun 12 '18 at 5:01



















3














You can try this code:



public String cleanInvalidCharacters(String in) {
StringBuilder out = new StringBuilder();
char current;
if (in == null || ("".equals(in))) {
return "";
}
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF))) {
out.append(current);
}

}
return out.toString().replaceAll("\s", " ");
}


It works for me to remove invalid characters from String.






share|improve this answer



















  • 3





    That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

    – Philipp Reichart
    Jun 13 '12 at 18:48





















1














You can use java.text.normalizer






share|improve this answer































    0














    Input => "This u7279text u7279is what I need"
    Output => "This text is what I need"



    If you are trying to remove Unicode characters from a string like above this code will work



    Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
    Matcher unicodeMatcher = unicodeChars.matcher(data);
    String cleanData = null;
    if (unicodeMatcher.find()) {
    cleanData = unicodeMatcher.replaceAll("");
    }





    share|improve this answer
























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f11020893%2fremove-non-ascii-non-printable-characters-from-a-string%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      6 Answers
      6






      active

      oldest

      votes








      6 Answers
      6






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      34














      Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.



      String clean = str.replaceAll("\P{Print}", "");


      Here, p{Print} represents a POSIX character class for printable ASCII characters, while P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)





      Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.



      This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.



      That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.



      String clean = str.replaceAll("\\x\p{XDigit}{2}", "");




      The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.






      share|improve this answer


























      • this doesn't work. may be I am doing something incorrect, but not working

        – daydreamer
        Jun 18 '12 at 18:15






      • 1





        @daydreamer Can you provide an SSCCE that shows what is not working?

        – erickson
        Jun 18 '12 at 18:19











      • public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

        – daydreamer
        Jun 18 '12 at 18:21











      • @daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

        – erickson
        Jun 18 '12 at 18:25













      • @daydreamer E.g. String s = "abc@gmailu00e9.com";

        – erickson
        Jun 18 '12 at 18:27
















      34














      Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.



      String clean = str.replaceAll("\P{Print}", "");


      Here, p{Print} represents a POSIX character class for printable ASCII characters, while P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)





      Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.



      This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.



      That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.



      String clean = str.replaceAll("\\x\p{XDigit}{2}", "");




      The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.






      share|improve this answer


























      • this doesn't work. may be I am doing something incorrect, but not working

        – daydreamer
        Jun 18 '12 at 18:15






      • 1





        @daydreamer Can you provide an SSCCE that shows what is not working?

        – erickson
        Jun 18 '12 at 18:19











      • public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

        – daydreamer
        Jun 18 '12 at 18:21











      • @daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

        – erickson
        Jun 18 '12 at 18:25













      • @daydreamer E.g. String s = "abc@gmailu00e9.com";

        – erickson
        Jun 18 '12 at 18:27














      34












      34








      34







      Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.



      String clean = str.replaceAll("\P{Print}", "");


      Here, p{Print} represents a POSIX character class for printable ASCII characters, while P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)





      Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.



      This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.



      That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.



      String clean = str.replaceAll("\\x\p{XDigit}{2}", "");




      The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.






      share|improve this answer















      Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.



      String clean = str.replaceAll("\P{Print}", "");


      Here, p{Print} represents a POSIX character class for printable ASCII characters, while P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)





      Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.



      This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.



      That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.



      String clean = str.replaceAll("\\x\p{XDigit}{2}", "");




      The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jun 7 '16 at 16:23

























      answered Jun 13 '12 at 18:39









      ericksonerickson

      224k42334431




      224k42334431













      • this doesn't work. may be I am doing something incorrect, but not working

        – daydreamer
        Jun 18 '12 at 18:15






      • 1





        @daydreamer Can you provide an SSCCE that shows what is not working?

        – erickson
        Jun 18 '12 at 18:19











      • public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

        – daydreamer
        Jun 18 '12 at 18:21











      • @daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

        – erickson
        Jun 18 '12 at 18:25













      • @daydreamer E.g. String s = "abc@gmailu00e9.com";

        – erickson
        Jun 18 '12 at 18:27



















      • this doesn't work. may be I am doing something incorrect, but not working

        – daydreamer
        Jun 18 '12 at 18:15






      • 1





        @daydreamer Can you provide an SSCCE that shows what is not working?

        – erickson
        Jun 18 '12 at 18:19











      • public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

        – daydreamer
        Jun 18 '12 at 18:21











      • @daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

        – erickson
        Jun 18 '12 at 18:25













      • @daydreamer E.g. String s = "abc@gmailu00e9.com";

        – erickson
        Jun 18 '12 at 18:27

















      this doesn't work. may be I am doing something incorrect, but not working

      – daydreamer
      Jun 18 '12 at 18:15





      this doesn't work. may be I am doing something incorrect, but not working

      – daydreamer
      Jun 18 '12 at 18:15




      1




      1





      @daydreamer Can you provide an SSCCE that shows what is not working?

      – erickson
      Jun 18 '12 at 18:19





      @daydreamer Can you provide an SSCCE that shows what is not working?

      – erickson
      Jun 18 '12 at 18:19













      public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

      – daydreamer
      Jun 18 '12 at 18:21





      public static void main(String args) throws UnsupportedEncodingException { String s = "abc@gmail\xe9.com"; String email = "abc@gmail.com\xa0\xa0"; System.out.println(s.replaceAll("\P{Print}", "")); System.out.println(email.replaceAll("\P{Print}", "")); } out put - abc@gmailxe9.com abc@gmail.comxa0xa0

      – daydreamer
      Jun 18 '12 at 18:21













      @daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

      – erickson
      Jun 18 '12 at 18:25







      @daydreamer \x doesn't mean anything special in Java source code. \ in a String or char literal is an escape sequence that is replaced with . If you want a Unicode escape, use uXXXX, where XXXX is the Unicode point, in hexadecimal.

      – erickson
      Jun 18 '12 at 18:25















      @daydreamer E.g. String s = "abc@gmailu00e9.com";

      – erickson
      Jun 18 '12 at 18:27





      @daydreamer E.g. String s = "abc@gmailu00e9.com";

      – erickson
      Jun 18 '12 at 18:27













      14














      With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:



      String printable = CharMatcher.INVISIBLE.removeFrom(input);
      String clean = CharMatcher.ASCII.retainFrom(printable);


      Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.






      share|improve this answer





















      • 4





        note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

        – Andrew White
        Aug 26 '14 at 13:47
















      14














      With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:



      String printable = CharMatcher.INVISIBLE.removeFrom(input);
      String clean = CharMatcher.ASCII.retainFrom(printable);


      Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.






      share|improve this answer





















      • 4





        note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

        – Andrew White
        Aug 26 '14 at 13:47














      14












      14








      14







      With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:



      String printable = CharMatcher.INVISIBLE.removeFrom(input);
      String clean = CharMatcher.ASCII.retainFrom(printable);


      Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.






      share|improve this answer















      With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:



      String printable = CharMatcher.INVISIBLE.removeFrom(input);
      String clean = CharMatcher.ASCII.retainFrom(printable);


      Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jun 13 '12 at 19:03

























      answered Jun 13 '12 at 18:47









      Philipp ReichartPhilipp Reichart

      18.6k55063




      18.6k55063








      • 4





        note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

        – Andrew White
        Aug 26 '14 at 13:47














      • 4





        note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

        – Andrew White
        Aug 26 '14 at 13:47








      4




      4





      note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

      – Andrew White
      Aug 26 '14 at 13:47





      note, INVISIBLE removed whitespace which I find odd since it is indeed "printable"

      – Andrew White
      Aug 26 '14 at 13:47











      10














      I know it's maybe late but for future reference:



      String clean = str.replaceAll("\P{Print}", "");


      Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.



      For that problem use inverted logic:



      String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");





      share|improve this answer
























      • Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

        – Mark Mullin
        Feb 1 '16 at 2:15






      • 2





        Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

        – Well Smith
        Apr 17 '18 at 15:35













      • Really helped me a lot Thanks @Ivan

        – Prinkal Kumar
        Jun 12 '18 at 5:01
















      10














      I know it's maybe late but for future reference:



      String clean = str.replaceAll("\P{Print}", "");


      Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.



      For that problem use inverted logic:



      String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");





      share|improve this answer
























      • Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

        – Mark Mullin
        Feb 1 '16 at 2:15






      • 2





        Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

        – Well Smith
        Apr 17 '18 at 15:35













      • Really helped me a lot Thanks @Ivan

        – Prinkal Kumar
        Jun 12 '18 at 5:01














      10












      10








      10







      I know it's maybe late but for future reference:



      String clean = str.replaceAll("\P{Print}", "");


      Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.



      For that problem use inverted logic:



      String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");





      share|improve this answer













      I know it's maybe late but for future reference:



      String clean = str.replaceAll("\P{Print}", "");


      Removes all non printable characters, but that includes n (line feed), t(tab) and r(carriage return), and sometimes you want to keep those characters.



      For that problem use inverted logic:



      String clean = str.replaceAll("[^\n\r\t\p{Print}]", "");






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jul 15 '15 at 7:33









      Ivan PavićIvan Pavić

      348318




      348318













      • Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

        – Mark Mullin
        Feb 1 '16 at 2:15






      • 2





        Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

        – Well Smith
        Apr 17 '18 at 15:35













      • Really helped me a lot Thanks @Ivan

        – Prinkal Kumar
        Jun 12 '18 at 5:01



















      • Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

        – Mark Mullin
        Feb 1 '16 at 2:15






      • 2





        Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

        – Well Smith
        Apr 17 '18 at 15:35













      • Really helped me a lot Thanks @Ivan

        – Prinkal Kumar
        Jun 12 '18 at 5:01

















      Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

      – Mark Mullin
      Feb 1 '16 at 2:15





      Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy)

      – Mark Mullin
      Feb 1 '16 at 2:15




      2




      2





      Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

      – Well Smith
      Apr 17 '18 at 15:35







      Got error: illegal escape character String clean = str.replaceAll("[^nrtp{Print}]", ""); . p should be P

      – Well Smith
      Apr 17 '18 at 15:35















      Really helped me a lot Thanks @Ivan

      – Prinkal Kumar
      Jun 12 '18 at 5:01





      Really helped me a lot Thanks @Ivan

      – Prinkal Kumar
      Jun 12 '18 at 5:01











      3














      You can try this code:



      public String cleanInvalidCharacters(String in) {
      StringBuilder out = new StringBuilder();
      char current;
      if (in == null || ("".equals(in))) {
      return "";
      }
      for (int i = 0; i < in.length(); i++) {
      current = in.charAt(i);
      if ((current == 0x9)
      || (current == 0xA)
      || (current == 0xD)
      || ((current >= 0x20) && (current <= 0xD7FF))
      || ((current >= 0xE000) && (current <= 0xFFFD))
      || ((current >= 0x10000) && (current <= 0x10FFFF))) {
      out.append(current);
      }

      }
      return out.toString().replaceAll("\s", " ");
      }


      It works for me to remove invalid characters from String.






      share|improve this answer



















      • 3





        That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

        – Philipp Reichart
        Jun 13 '12 at 18:48


















      3














      You can try this code:



      public String cleanInvalidCharacters(String in) {
      StringBuilder out = new StringBuilder();
      char current;
      if (in == null || ("".equals(in))) {
      return "";
      }
      for (int i = 0; i < in.length(); i++) {
      current = in.charAt(i);
      if ((current == 0x9)
      || (current == 0xA)
      || (current == 0xD)
      || ((current >= 0x20) && (current <= 0xD7FF))
      || ((current >= 0xE000) && (current <= 0xFFFD))
      || ((current >= 0x10000) && (current <= 0x10FFFF))) {
      out.append(current);
      }

      }
      return out.toString().replaceAll("\s", " ");
      }


      It works for me to remove invalid characters from String.






      share|improve this answer



















      • 3





        That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

        – Philipp Reichart
        Jun 13 '12 at 18:48
















      3












      3








      3







      You can try this code:



      public String cleanInvalidCharacters(String in) {
      StringBuilder out = new StringBuilder();
      char current;
      if (in == null || ("".equals(in))) {
      return "";
      }
      for (int i = 0; i < in.length(); i++) {
      current = in.charAt(i);
      if ((current == 0x9)
      || (current == 0xA)
      || (current == 0xD)
      || ((current >= 0x20) && (current <= 0xD7FF))
      || ((current >= 0xE000) && (current <= 0xFFFD))
      || ((current >= 0x10000) && (current <= 0x10FFFF))) {
      out.append(current);
      }

      }
      return out.toString().replaceAll("\s", " ");
      }


      It works for me to remove invalid characters from String.






      share|improve this answer













      You can try this code:



      public String cleanInvalidCharacters(String in) {
      StringBuilder out = new StringBuilder();
      char current;
      if (in == null || ("".equals(in))) {
      return "";
      }
      for (int i = 0; i < in.length(); i++) {
      current = in.charAt(i);
      if ((current == 0x9)
      || (current == 0xA)
      || (current == 0xD)
      || ((current >= 0x20) && (current <= 0xD7FF))
      || ((current >= 0xE000) && (current <= 0xFFFD))
      || ((current >= 0x10000) && (current <= 0x10FFFF))) {
      out.append(current);
      }

      }
      return out.toString().replaceAll("\s", " ");
      }


      It works for me to remove invalid characters from String.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jun 13 '12 at 18:17









      Paulius MatulionisPaulius Matulionis

      15.8k1890131




      15.8k1890131








      • 3





        That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

        – Philipp Reichart
        Jun 13 '12 at 18:48
















      • 3





        That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

        – Philipp Reichart
        Jun 13 '12 at 18:48










      3




      3





      That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

      – Philipp Reichart
      Jun 13 '12 at 18:48







      That's a lot of magic numbers. How about extracting these clauses (especially the ranges) into aptly named local variables?

      – Philipp Reichart
      Jun 13 '12 at 18:48













      1














      You can use java.text.normalizer






      share|improve this answer




























        1














        You can use java.text.normalizer






        share|improve this answer


























          1












          1








          1







          You can use java.text.normalizer






          share|improve this answer













          You can use java.text.normalizer







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jun 13 '12 at 18:17









          exceptionexception

          7251822




          7251822























              0














              Input => "This u7279text u7279is what I need"
              Output => "This text is what I need"



              If you are trying to remove Unicode characters from a string like above this code will work



              Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
              Matcher unicodeMatcher = unicodeChars.matcher(data);
              String cleanData = null;
              if (unicodeMatcher.find()) {
              cleanData = unicodeMatcher.replaceAll("");
              }





              share|improve this answer




























                0














                Input => "This u7279text u7279is what I need"
                Output => "This text is what I need"



                If you are trying to remove Unicode characters from a string like above this code will work



                Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
                Matcher unicodeMatcher = unicodeChars.matcher(data);
                String cleanData = null;
                if (unicodeMatcher.find()) {
                cleanData = unicodeMatcher.replaceAll("");
                }





                share|improve this answer


























                  0












                  0








                  0







                  Input => "This u7279text u7279is what I need"
                  Output => "This text is what I need"



                  If you are trying to remove Unicode characters from a string like above this code will work



                  Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
                  Matcher unicodeMatcher = unicodeChars.matcher(data);
                  String cleanData = null;
                  if (unicodeMatcher.find()) {
                  cleanData = unicodeMatcher.replaceAll("");
                  }





                  share|improve this answer













                  Input => "This u7279text u7279is what I need"
                  Output => "This text is what I need"



                  If you are trying to remove Unicode characters from a string like above this code will work



                  Pattern unicodeCharsPattern = Pattern.compile("\\u(\p{XDigit}{4})");
                  Matcher unicodeMatcher = unicodeChars.matcher(data);
                  String cleanData = null;
                  if (unicodeMatcher.find()) {
                  cleanData = unicodeMatcher.replaceAll("");
                  }






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered May 10 '17 at 15:04









                  Sivaram KandappanSivaram Kandappan

                  11




                  11






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f11020893%2fremove-non-ascii-non-printable-characters-from-a-string%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Wiesbaden

                      Marschland

                      Dieringhausen