Tags:
create new tag
, view all tags

How do I Identify and Poor-man-debug if Perl thinks a Specific Variable is in UTF8 Format

From man perluniintro

How Do I Know Whether My String Is In Unicode?

You shouldn’t care. No, you really shouldn’t. No, really. If you have to care--beyond the cases described above--it means that we didn’t get the transparency of Unicode quite right.

Okay, if you insist:

               print utf8::is_utf8($string) ? 1 : 0, "\n";
But note that this doesn’t mean that any of the characters in the string are necessary UTF-8 encoded, or that any of the characters have code points greater than 0xFF (255) or even 0x80 (128), or that the string has any characters at all. All the "is_utf8()" does is to return the value of the internal "utf8ness" flag attached to the $string. If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar are interpreted as the (multi-byte, variable-length) UTF-8 encoded code points of the characters. Bytes added to an UTF-8 encoded string are automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be UTF-8 encoded as if copies of the byte strings were upgraded to UTF-8: for example,
               $a = "ab\x80c";
               $b = "\x{100}";
               print "$a = $b\n";
the output string will be UTF-8-encoded "ab\x80c = \x{100}\n", but $a will stay byte-encoded.

Sometimes you might really need to know the byte length of a string instead of the character length. For that use either the "Encode::encode_utf8()" function or the "bytes" pragma and its only defined function "length()":

               my $unicode = chr(0x100);
               print length($unicode), "\n"; # will print 1
               require Encode;
               print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
               use bytes;
               print length($unicode), "\n"; # will also print 2
                                             # (the 0xC4 0x80 of the UTF-8)

Discussion

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2008-05-31 - CrawfordCurrie
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.