CONCORDANCE


XConcord

The XConcord program is a concordance tool that allows KWIC (Key Word In Context) searches to be done in text in as many as 17 languages. It is designed to be easy to work with so that teachers and students can use XConcord in the classroom to identify relevant texts by viewing words and expressions in context. Searching is quick and the size of the corpus is limited only by available disk space. Using an implementation of the Boyer-Moore search algorithm specially adapted for wide characters, XConcord can search at over 1MB per second, eliminating the need for pre-indexing on many moderate scale corpora.

Searching is very flexible. Users can match any string with any part of a word or phrase. Users can also limit the search to only those concordances either containing or missing specified strings in the context to the left or right of the keyword.

XConcord shows the results in a KWIC display and also, as seen in the smaller bottom window in Figure 1, the complete sentence for the selected KWIC line. The complete document is displayed in yet another window. Easy methods for saving individual sentences or complete documents to new text files are provided. The users can then edit these files or use XConcord to print the results.

Highlights

Implementation

XConcord can be run as a stand-alone program or integrated with applications such as Oleada or Cibola. Internally, the search engine of XConcord
is designed to support either a command line or programmatic interface. XConcord itself uses the programmatic interface, but the search engine is an independent module. Other programs can incorporate the search capabilities of XConcord either by using the search engine as a subroutine library or by invoking the command line version of the program. The availability of these different interfaces should make the rapid integration of XConcord-like capabilities into other programs relatively painless.

To improve speed and decrease machine load, XConcord uses the virtual memory interface provided by mmap on the Sun to access the source text. This allows the operating system to fully overlap disk transfer with searching. Asynchronous input such as this allows machines with sufficient real memory to provide very impressive effective search speeds since the entire source text is kept in memory if possible. On machines without sufficient real memory to do this, or in the case of very large files, the use of the virtual memory mechanism allows graceful degradation of performance. In the limit, performance is bounded to something similar to what conventional file I/O could provide. Since all files are mapped read only, no swap space is required for the searched text. Instead, the original file provides the backing store for the text.

Currently XConcord supports all of the common coding schemes and the most popular input methods for Chinese. and Japanese. We plan to add the ability to handle and display text encoded using Unicode as the newer versions of our multi-lingual text widget stabilize.

Some of the input types supported are:

  • Arabic
  • Armenian
  • Cyrillic
  • Ethiopic
  • Farsi/Pashto
  • Georgian
  • Hebrew
  • Japanese
  • Korean
  • Lao
  • Latin-1
  • SerboCroat
  • Simplified Chinese
  • Thai
  • Traditional Chinese
  • Vietnamese


Oleada/Cíbola Home Page
Last Modified: 11:30am MDT, July 25, 1996