XConcordThe XConcord program is a concordance tool that allows
KWIC (Key Word In Context) searches to be done in text in as many as
17 languages. It is designed to be easy to work with so that teachers
and students can use XConcord in the classroom to identify relevant
texts by viewing words and expressions in context. Searching is quick
and the size of the corpus is limited only by available disk
space. Using an implementation of the Boyer-Moore search algorithm
specially adapted for wide characters, XConcord can search at over
1MB per second, eliminating the need for pre-indexing on many moderate
scale corpora.
Searching is very flexible. Users can match any string with any part of a word or phrase. Users can also limit the search to only those concordances either containing or missing specified strings in the context to the left or right of the keyword.
XConcord shows the results in a KWIC display and also, as seen in the smaller bottom window in Figure 1, the complete sentence for the selected KWIC line. The complete document is displayed in yet another window. Easy methods for saving individual sentences or complete documents to new text files are provided. The users can then edit these files or use XConcord to print the results.
Highlights
Implementation
XConcord can be run as a stand-alone program or integrated with
applications such as Oleada or Cibola. Internally, the search engine
of XConcord | is designed to support either a command line or
programmatic interface. XConcord itself uses the programmatic
interface, but the search engine is an independent module. Other
programs can incorporate the search capabilities of XConcord either
by using the search engine as a subroutine library or by invoking the
command line version of the program. The availability of these
different interfaces should make the rapid integration of
XConcord-like capabilities into other programs relatively
painless.
To improve speed and decrease machine load, XConcord uses the virtual memory interface provided by mmap on the Sun to access the source text. This allows the operating system to fully overlap disk transfer with searching. Asynchronous input such as this allows machines with sufficient real memory to provide very impressive effective search speeds since the entire source text is kept in memory if possible. On machines without sufficient real memory to do this, or in the case of very large files, the use of the virtual memory mechanism allows graceful degradation of performance. In the limit, performance is bounded to something similar to what conventional file I/O could provide. Since all files are mapped read only, no swap space is required for the searched text. Instead, the original file provides the backing store for the text.
Currently XConcord supports all of the common coding schemes and the most popular input methods for Chinese. and Japanese. We plan to add the ability to handle and display text encoded using Unicode as the newer versions of our multi-lingual text widget stabilize.
Some of the input types supported are:
- Arabic
- Armenian
- Cyrillic
- Ethiopic
- Farsi/Pashto
- Georgian
- Hebrew
- Japanese
- Korean
- Lao
- Latin-1
- SerboCroat
- Simplified Chinese
- Thai
- Traditional Chinese
- Vietnamese
|