How do I Identify and Poor-man-debug if Perl thinks a Specific Variable is in UTF8 Format

From man perluniintro

How Do I Know Whether My String Is In Unicode?

You shouldn’t care. No, you really shouldn’t. No, really. If you have to care--beyond the cases described above--it means that we didn’t get the transparency of Unicode quite right.

Okay, if you insist:

               print utf8::is_utf8($string) ? 1 : 0, "\n";

But note that this doesn’t mean that any of the characters in the string are necessary UTF-8 encoded, or that any of the characters have code points greater than 0xFF (255) or even 0x80 (128), or that the string has any characters at all. All the "is_utf8()" does is to return the value of the internal "utf8ness" flag attached to the $string. If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar are interpreted as the (multi-byte, variable-length) UTF-8 encoded code points of the characters. Bytes added to an UTF-8 encoded string are automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be UTF-8 encoded as if copies of the byte strings were upgraded to UTF-8: for example,

               $a = "ab\x80c";
               $b = "\x{100}";
               print "$a = $b\n";

the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but $a will stay byte-encoded.

Sometimes you might really need to know the byte length of a string instead of the character length. For that use either the "Encode::encode_utf8()" function or the "bytes" pragma and its only defined function "length()":

               my $unicode = chr(0x100);
               print length($unicode), "\n"; # will print 1
               require Encode;
               print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
               use bytes;
               print length($unicode), "\n"; # will also print 2
                                             # (the 0xC4 0x80 of the UTF-8)

Discussion

BasicForm
TopicClassification	TWikiDevDoc
TopicSummary
InterestedParties
RelatedTopics

Topic revision: r3 - 2008-05-31 - CrawfordCurrie

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.