Character encoding in Java
Java iterates through all the characters that the
string represents and turns each one into a number of bytes and finally put the
bytes together. The rule that maps each Unicode character into a byte array is
called a character encoding.
So it’s possible that if same character encoding is
not used during encoding and decoding then retrieved value may not be correct.
When we call str.getBytes() without specifying a character encoding scheme, the
JVM uses the default character encoding of platform to do the job. The default
encoding scheme is operating system and locale dependent. On Linux, it is UTF-8
and on Windows with a US locale, the default encoding is Cp1252.
For example :
public class Test {
public static void
main(String[] args) throws Exception {
char[] chars = new char[] {‘\u0097’};
String str = new String(chars);
byte[] bytes = str.getBytes();
System.out.println(Arrays.toString(bytes));
}
}
In this program, we are first creating a String from a
character array, which just has one character ‘\u0097’, after than we are
getting byte array from that String and printing that byte. Since \u0097 is
within the 8-bit range of byte primitive type, it is reasonable to guess that
the str.getBytes() call will return a byte array that contains one element with
a value of -105((byte) 0x97). However, that’s not what the program prints.
As a matter of fact, the output of the program is
operating system and locale dependent. On a Windows XP with the US locale, the
above program prints [63], if you run this program on Linux or Solaris, you
will get different values.