How do I Identify and Poor-man-debug if Perl thinks a Specific Variable is in UTF8 Format
From
man perluniintro
How Do I Know Whether My String Is In Unicode?
You shouldn’t care. No, you really shouldn’t. No, really. If you have to care--beyond the cases described above--it means that we didn’t
get the transparency of Unicode quite right.
Okay, if you insist:
print utf8::is_utf8($string) ? 1 : 0, "\n";
But note that this doesn’t mean that any of the characters in the string are necessary UTF-8 encoded, or that any of the characters have
code points greater than 0xFF (255) or even 0x80 (128), or that the string has any characters at all. All the "is_utf8()" does is to
return the value of the internal "utf8ness" flag attached to the $string. If the flag is off, the bytes in the scalar are interpreted as a
single byte encoding. If the flag is on, the bytes in the scalar are interpreted as the (multi-byte, variable-length) UTF-8 encoded code
points of the characters. Bytes added to an UTF-8 encoded string are automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8
scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be
UTF-8 encoded as if copies of the byte strings were upgraded to UTF-8: for example,
$a = "ab\x80c";
$b = "\x{100}";
print "$a = $b\n";
the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but $a will stay byte-encoded.
Sometimes you might really need to know the byte length of a string instead of the character length. For that use either the
"Encode::encode_utf8()" function or the "bytes" pragma and its only defined function "length()":
my $unicode = chr(0x100);
print length($unicode), "\n"; # will print 1
require Encode;
print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
use bytes;
print length($unicode), "\n"; # will also print 2
# (the 0xC4 0x80 of the UTF-8)
Discussion