x
This website is using cookies. We use cookies to ensure that we give you the best experience on our website. More info. That's Fine
HPC:Factor Logo 
 
Latest Forum Activity

Unicode <-> Big5 sample / utility

Snappy! Page Icon Posted 2006-01-27 7:41 PM
#
Avatar image of Snappy!
H/PC Elder

Posts:
1,712
Location:
New Mexico, US
Status:
Hi all,

I recently wrote a memopad app "HPCMemo" for my HPC J720. TomboRoot did the job but I got tired of trying to get it to display and save chinese (Big5) characters properly. So I wrote the above app and in the process wrote a simple class to convert, read, save text files from Big5 to and fro Unicode.

Attached is a screen shot. Will post the code and/or binary later if anyone wants.



(hpcmemo.JPG)



Attachments
----------------
Attachments hpcmemo.JPG (26KB - 6 downloads)
 Top of the page
cmonex Page Icon Posted 2006-01-27 8:16 PM
#
Avatar image of cmonex
H/PC Oracle

Posts:
16,175
Location:
Budapest, Hungary
Status:
i'm interested
 Top of the page
takwu Page Icon Posted 2006-01-28 3:50 PM
#
Avatar image of takwu
H/PC Elder

Posts:
1,953
Location:
BC, Canada
Status:
Snappy! - 2006-01-27 4:41 PM
. . . a simple class to convert, read, save text files from Big5 to and fro Unicode.

I'm not exactly sure what you mean. So if you have a Unicode text file, it can convert the Chinese characters to Big5 code and save as a normal (single byte) text file? And vice versa? Hmm, interesting...
 Top of the page
Snappy! Page Icon Posted 2006-01-28 7:07 PM
#
Avatar image of Snappy!
H/PC Elder

Posts:
1,712
Location:
New Mexico, US
Status:
takwu - 2006-01-28 1:50 PM

Snappy! - 2006-01-27 4:41 PM
. . . a simple class to convert, read, save text files from Big5 to and fro Unicode.

I'm not exactly sure what you mean. So if you have a Unicode text file, it can convert the Chinese characters to Big5 code and save as a normal (single byte) text file? And vice versa? Hmm, interesting...


Yeah ... Big5 code is multi-byte or also called Double-Byte character (MBCS, DBCS). So standard ANSI characters like A-Z etc are represented as one byte while chinese characters are represented as 2 bytes, 1 lead byte and another ... 2nd byte? ... Forgot what the 2nd byte is called or if it has a special name.

UNICODE represents *everything* in 2bytes, so an A (ASCII 65) is [65][0] with the null trailing byte. A "NULL" is represented as [0][0] in UNICODE.

In any case, the class I wrote allows you to convert from Big5 to and from UNICODE.

This is needed because the display portion in wince is in UNICODE. In fact, the whole wince thingie is in UNICODE. So if you throw a Big5 string to the system, you get strange characters. The MBCS to UNICODE conversion api provided by eVC would have worked fine ... if only the code pages are available! Big5 code page 950 is invalid by default. So calls to all those MBCS, UNICODE apis ith code page 950 will basically give you rubbish or an error.

I tried to find some means to import the code page support, ie have the conversion table that is supported by wince, but gave up and decided its faster to just write my own.

Anyhow, its a very simple class doing a table look up. There are ways to optimize the search, such as

1. Having a primary Big5 lead byte lookup table before the actual lookup table. The conversion table is already sorted by Big5 code, so this would increase the Big5->UNICODE conversion.
2. Having a separate UNICODE->Big5 table, sorted against UNICODE lead byte and implement #1 would optimize UNICODE->Big5 conversion.

I'm right now fiddling with the hpcmemo app so am not quite bothered with optimization as yet. That will be for the beta phase where I clean up and optimize the code.

I've already "optimize" it a little by scanning the table in 4 byte jumps instead of the earlier 1byte sequential lookup.

In case anyone has an easier built-in solution, post it here too. This class would still be an interesting exercise for those who want to implement their own codepage conversion or to enable those that does not have native support.
 Top of the page
Snappy! Page Icon Posted 2006-01-28 7:34 PM
#
Avatar image of Snappy!
H/PC Elder

Posts:
1,712
Location:
New Mexico, US
Status:
ok, here's the binary and two test files. text.txt - big5 encoded, textU.txt - unicode encoded.

I'll post the source for the binary testunicode.exe in awhile. - DONE.

PS: The binary is current only in arm, hpc2000 format. So, J720 folks can try it out first ...

btw, big5uni.dat is the binary conversion table. It needs to be in the "\" directory.

Edited by Snappy! 2006-01-28 7:40 PM




Attachments
----------------
Attachments testunicode.zip (61KB - 10 downloads)
Attachments big5uni.dat (53KB - 10 downloads)
 Top of the page
C:Amie Page Icon Posted 2006-01-28 9:34 PM
#
Avatar image of C:Amie
Administrator
H/PC Oracle

Posts:
17,990
Location:
United Kingdom
Status:
Nice job, are you using adoce for the back-end?
 Top of the page
Snappy! Page Icon Posted 2006-01-29 9:04 AM
#
Avatar image of Snappy!
H/PC Elder

Posts:
1,712
Location:
New Mexico, US
Status:
C:Amie - 2006-01-28 7:34 PM

Nice job, are you using adoce for the back-end?


No ... its a simple binary flat-file database. ...

Here's the structure of the big5uni.dat file

4byte-record
4byte-record
4byte-record
4byte-record
...
4byte-record
4byte-record
4byte-record

Each 4byte-record is a

2byte-ANSI 2byte-UNICODE

That's all folks.

For anyone who wants to implement the primary lookup for optimizing conversion, here's the extra "lookup header" block to implement:

[PRIMARY LOOKUP BLOCK]
[BIG5-UNICODE TABLE] (As above)

Structure of [PRIMARY LOOKUP BLOCK]
5byte record
5byte record
...
5byte record

each 5byte record has
1byte-4byte

1byte: Big5 lead byte
4byte: offset to start of block in [BIG5-UNICODE TABLE]
 Top of the page
C:Amie Page Icon Posted 2006-01-29 11:24 AM
#
Avatar image of C:Amie
Administrator
H/PC Oracle

Posts:
17,990
Location:
United Kingdom
Status:
Good call. Makes for an easy backup solution for the system in the absence of a sync module.
 Top of the page
Jump to forum:
Seconds to generate: 0.156 - Cached queries : 66 - Executed queries : 10