GaijinPot

+ Reply to Thread
Results 1 to 6 of 6

Thread: japanese UTF8

  1. #1
    Sensei jron's Avatar
    Join Date
    Sep 2009
    Location
    Washington DC, USA
    Posts
    775

    Default japanese UTF8

    Does anyone know off-hand if Japanese UTF-8 characters are all 3 bytes long? It seems to me so far that this is correct but I haven't exhaustively validated it....

    Also.. I'll go ahead and ask.. Does anyone have a reasonable algorithm for performing Japanese character lookups into the EDICT dictionary for word recognition? I converted the dictionary to UTF-8 first. I have begun doing it by building a byte character tree and navigating it a byte at a time to find the words. Each word search starts at the top of the tree and unfound characters are simply pushed directly to the output.. Accounting for verb forms, etc, is built into the tree structure.

    Thanks,
    Last edited by jron; 2009-11-27 at 12:41 PM.

  2. #2
    GrandMasterPot
    Join Date
    Feb 2009
    Posts
    1,116

    Default

    UTF8 characters are not all 3-bytes, there are 2-byte UTF8 characters as well. But I dont know if there are any 2-byte Japanese characters. I think I read one time that there are, but I'm not fully confident on that.

    Are you using PHP? If so, and you want to be doing proper searches on Japanese characters, you should set the internal encoding to EUC-JP, as it is the most reliable charset for Japanese in PHP. However, if you do this you will want to convert to Shift_JIS when outputting to the screen, as many devices don't display EUC-JP very well. And emails should always be in ISO-2022-JP.

  3. #3
    Sensei jron's Avatar
    Join Date
    Sep 2009
    Location
    Washington DC, USA
    Posts
    775

    Default

    Quote Originally Posted by Effected After View Post
    UTF8 characters are not all 3-bytes, there are 2-byte UTF8 characters as well. But I dont know if there are any 2-byte Japanese characters. I think I read one time that there are, but I'm not fully confident on that.

    Are you using PHP? If so, and you want to be doing proper searches on Japanese characters, you should set the internal encoding to EUC-JP, as it is the most reliable charset for Japanese in PHP. However, if you do this you will want to convert to Shift_JIS when outputting to the screen, as many devices don't display EUC-JP very well. And emails should always be in ISO-2022-JP.
    ya, UTF-8 can be up to 6 bytes long.. But I think all the Japanese characters are encoded as 3 bytes.. I was going to validate this but I thought I would ask here first cause I'm lazy.. All the parsing/checking is being done in C because it was easier to develop tree in memory that could then be dumped directly to disk based on byte offset counts.

    One of the reasons I was using UTF-8 was because my unix box doesn't seem to like displaying other encodings much. UTF-8 seems to work just fine in all the browsers I tried as well as unix command line and VI. Is there a situation where UTF-8 doesn't work well in particular?

    Just FYI, I'm new to international character encoding.. Been coding for 25 years and never had to deal with it but I figured that it was a hole in my knowledge that I needed to fill a bit..

    Thanks,
    Last edited by jron; 2009-11-27 at 09:58 PM.

  4. #4
    GrandMasterPot
    Join Date
    Feb 2009
    Posts
    1,116

    Default

    Some devices, like my cell phone, have troubles with some UTF8 kanji, but for the most part its ok. I do get the occasional garbling though.

    But PC based browsers all seem to be fine with UTF8.

  5. #5
    Senior Member
    Join Date
    Sep 2008
    Posts
    185

    Default

    The big problem with modern mobiles and UTF8 is mostly incomplete fonts rather than the actual decode handling.

  6. #6
    Sensei jron's Avatar
    Join Date
    Sep 2009
    Location
    Washington DC, USA
    Posts
    775

    Default

    well, my target for this will be only hiragana and katakana only output to an iphone.

    Where I am really having to deal with this is in building a parse tree for the characters to map incoming kanji text to the EDICT dictionary.

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts