The UCData Unicode Character Properties and Bidi Algorithm Package
This library is freeware.
Download source [Version: 2.9 Date: 25 March 2005].
(.tar.bz2)
(.tar.gz)
(.zip)
The UCData library provides code to generate and use a compact set of
small databases of Unicode character properties. These properties include:
- Whether characters are alphabetic, numeric, etc. (like isalpha()).
- Upper, lower, and title case mappings
- Numeric value of characters representing numbers.
- Combining class values for combining characters.
- Decompositions for characters that are a combination of other characters.
- Compositions of characters.
- Some support for UTF-8 characters (character properties and downcasing).
- Conversion between UTF-32 and UTF-8.
A Java class with most of the same API is provided and uses the same data
files as the C library, loading the files from a base URL.
UCData includes two other items that may be of interest:
- A bidirectional reordering algorithm (the PGBA, described in more
detail below) that works entirely off of character properties as opposed
to properties and levels as in the Unicode bidi reordering reference
implementation.
- A Tuned Boyer-Moore implementation for UTF-8 text that supports case
insensitive searches in a relatively efficient manner.
UCData API Documentation.
Pretty Good Bidi Algorithm
The Pretty Good Bidi Algorithm is a small, fairly simple, reasonably
fast implicit bidirectional reordering algorithm that works pretty
good. The purpose of this implementation is to demonstrate that approaches
other than those adhering strictly to the Unicode reference algorithm are
possible.
As far as it has been tested, this implementation produces the same results as
the reference bidi reordering implementation provided at
http://www.unicode.org/reports/tr9/BidiReferenceJava/. More involved
testing will be done prior to the release of version 3.0.
The PGBA currently only handles implicit reordering of Unicode text and
does not yet handle the explicit bidi codes such as LRE, RLE, LRO, RLO, PDF.
Some things about the PGBA:
- Currently, the results differ from the Unicode reference implementation in
that symmetric characters such as parenthesis and square brackets are already
swapped in the reordered string. This may be removed starting in version
3.0.
- The reordering algorithm does not currently make full use of the the
character properties provided in the ucdata package (see
ucdata.h) but might benefit from doing so. This exercise is left
to any developer that wants to run with it.
- A new command line program called
testpgba has been provided
to help visualize the results of reordering and optionally, cursor motion
through the reordered string. See the testpgba.txt file for
details on running the program.
PGBA API Documentation.
Tuned Boyer-Moore implementation for UTF-8 text
This implementation of a Tuned Boyer-Moore routine for UTF-8 text capable of
case-insensitive matching was written in response to a complaint that one
wasn't readily available that developers could use as a reference or simply
incorporate in their projects. Although this implementation does depend on the
underlying UCData library, it can easily be retrofitted to use some other
character property lookup library. It also happens to be reasonably fast, but
could probably be improved speedwise.
I will leave it to other developers to make improvements and share them, just
as the basic routine has been shared with them.
No documentation of the API is available yet, but there is an example at the
end of the utf8bm.c file.