Character Classes

Legacy Character Classes

Character classes as defined in ANSI C and POSIX aim to provide a simple means to determine properties of an arbitrary character. The properties that we can analysis are the character's functional use in text (punctuation, alphabetic, digital, space, etc), relational ( [=c=], case insensitive comparisons ) which the locale mechanism lets us make specific to a language convention, orthographic semantics (uppercase, lowercase) and the character's address range (printable, ASCII).

A software developer working with Ethiopic text will want the same utilities for character analysis as he or she had for single byte text. Unicode aware resources will allow for generic character tests so that script specific functions can be avoided. For example a test for an Ethiopic punctuation might take the form:

if ( isrange ( ch, "ethiopic" ) && ispunct ( ch ) ) { ...

and avoids the need for an "isEthiopicPunct" equivalent.

While progress is being made towards Unicode based and multilingual aware pattern matching tools, by and large existing tools remain oriented towards the properties of western, Latin based, scripts. In this section we will look at extension of the character class paradigm to accommodate needs particular to Ethiopic script and languages while trying to maintain a perspective towards syllabic writing systems in general.

Syllabic Classes

The first simple extension to the standard character class lexicon would be the introduction of "[:syllable:]" as the logical analog to "[:alpha:]". "[:syllable:]" would match any character with a syllabic property, it is left to the user to restrict the syllabary of interest.

When we looked at Ethiopic character entity names we had our first introduction to the concept of Ethiopic character classes. In fact the naming convention proposed is built upon recognizing the "rows" and "columns" of the traditional syllabary with discretely named terms. It is a common problem to want to detect either the row or column property of an arbitrary Ethiopic character, as we will examine in more detail shortly. We can employ row and column names to define additional POSIX style character classes as follows:

Pattern	Expansion	Description
[:ላዊ:] (or [:lawi:])	[ለ-ሏ]	Match any syllable in the "Lawi" family (ለ thru ሏ).
[:ካፍ:] (or [:kaf:])	[ከ-ኮኰኲ-ኵ]	Match any syllable in the "Kaf" family (ከ thru ኵ but excluding undefined address in between, all entity names beginning with "kaf-").
[:ግዕዝ:] (or [:geez:])	[ሀለሐመሠረሰሸቀቐበቨተቸኀነኘአከኸወዐዘዠየደዸጀገጘጠጨጰጸፀፈፐ]	Match any ግዕዝ (first) form syllable.

This approach maintains a Ge'ez script perspective and requires the definition of a considerable number of pattern matching terms meaningful only in the applicable address range. The approach can be generalized for use with other syllabaries as follows:

Pattern	Expansion	Description
[:ለ:]	[ለ-ሏ]	Match any character in the same family with ለ. Likewise for [:ሉ:], [:ሊ:], [:ላ:], etc.
[:ከ:]	[ከ-ኮኰኲ-ኵ]	Match any character in the same family with ከ. Likewise for [:ኩ:], [:ኪ:], [:ካ:], etc.
(isrange("ethiopic") && [#1#])	[ሀለሐመሠረሰሸቀቐበቨተቸኀነኘአከኸወዐዘዠየደዸጀገጘጠጨጰጸፀፈፐ]	Match any first form syllable in the Ethiopic context.

Note that in the final pattern for detecting a first form syllable some means was required to restrict the context to Ethiopic script so as not to match a first form syllable from any syllabary. The percentage sign, "%", is used here to specify the syllabic context by employing the "modulo" meaning of the symbol.

Locale Based Equivalence Classes

In the strictest sense no two members of the Ethiopic syllabary would have the same same phonemic value. The presence of ኧ (U+12A8) is the telltale indicator that this is not quite the case. As a consequence of the phonemic decay of many Ethiopic syllographs spelling correctness in Ethiopian and Eritrean languages works on the notion of proximity. While each language may recognize a canonical spelling for a given word, a rendering may still be regarded as "correct" or acceptable based on its orthographic distance from the canonical rendering. Further, as with American and British English spellings, the canonical spellings are also allowed to change for the same word across national or linguistic borders.

To handle these conventions, pattern matching software need be made aware of the localized rules. A demonstrative sampling is offered in our next table:

Equivalence Class	Locale	Expansion	Comment
[=ሃ=]	Amharic	[ሀሃሐሓኀኃኻ]	Also needed for [=ሁ=], [=ሂ=], [=ሃ=], etc.
	Tigrigna (Et)	[ሀሃኀኃ]	ሐ and ኸ series and different phonemes in Tigrigna.
	Tigrigna (Er)	[ሀሃ]	Redundant ኀ series is dropped.
[=ሰ=]	Amharic / Tigrigna (Et)	[ሰሠ]	Also needed for [=ሱ=], [=ሲ=], [=ሳ=], etc.
[=ሰ=]	Tigrigna (Er)	[ሰ]	Redundant ሠ series is dropped.
[=ቈ=]	All	[ቆቈ]	Example: ቈነሰ vs ቆነሰ.
[=ቍ=]	All	[ቁቍ]	Example: ቍጥር vs ቁጥር.
[=አ=]	Amharic	[አኣዐዓ]	Also needed for [=ኡ=], [=ኢ=], [=አ=], etc.
	Tigrigna (Et)	[አኣ]	አ and ዐ series and different phonemes in Tigrigna.
	Tigrigna (Er)	[አ]	አ is not interchangle with ኣ though phonetically equivalent.
[=ኮ=]	All	[ኮኰ]	Example: መኮንን vs መኰንን.
[=ጎ=]	All	[ጎጐ]	Example: ጎንዳር vs ጐንዳር.
[=ጸ=]	Amharic / Tigrigna (Et)	[ጸፀ]	Also needed for [=ጹ=], [=ጺ=], [=ጻ=], etc.
[=ጸ=]	Tigrigna (Er)	[ጸ]	Redundant ፀ series is dropped.
Eritrean conventions are based on the conventions taught in primary education since 1991. Ethiopian Tigrigna stresses the same conventions but is more forgiving in the use of the redundant syllabic series.

Eritrean conventions are based on the conventions taught in primary education since 1991. Ethiopian Tigrigna stresses the same conventions but is more forgiving in the use of the redundant syllabic series.

The table demonstrates a sample of useful character classes and how they would vary with locale setting. With collectively over 80 languages in Eritrea and Ethiopia the table is not intended to be comprehensive but demonstrative with the most familiar classes. Notably a class to fold all ግዕዝ and ራዕብ forms would be desirable when working with southern language of Ethiopia - where many of the classes shown above would not be applicable. It is also worth noting that Ge'ez and Ari may share character classes while not sharing character phonemes. This helps highlight the separation of spoken language from orthography, in this case without consequence to pattern matching, and is certainly the exception and not the rule, as we will now see.

Demonstrating the classes we can consider the case of the Ethiopian Tafari Mekonnen who worked his way thru a military career up to the commander rank of "Ras". His life took a turn to the orthographically more complex when in 1930 Ras Tafari became Emperor of Ethiopia and assumed the coronation name Haile Selassie I. "Tafari Mekonnen" had only two possible spellings while "Haile Selassie" has numerous (not to imply for a moment though that HIM would have used anything but the canonical form). We'll look at these possibilities along with the female name ዓለምፀሐይ and its many spellings.

Locale	[=አ=]ለም[=ጸ=][=ሃ=]ይ				.	[=ሃ=]ይለ [=ስ=]ላ[=ሴ=]
Amharic	አለምጸሀይ አለምጸሃይ አለምጸሐይ አለምጸሓይ አለምጸኀይ አለምጸኃይ አለምጸኻይ	አለምፀሀይ አለምፀሃይ አለምፀሐይ አለምፀሓይ አለምፀኀይ አለምፀኃይ አለምፀኻይ	ኣለምጸሀይ ኣለምጸሃይ ኣለምጸሐይ ኣለምጸሓይ ኣለምጸኀይ ኣለምጸኃይ ኣለምጸኻይ	ኣለምፀሀይ ኣለምፀሃይ ኣለምፀሐይ ኣለምፀሓይ ኣለምፀኀይ ኣለምፀኃይ ኣለምፀኻይ		ሀይለ ሥላሤ ሀይለ ሥላሴ ሀይለ ስላሤ ሀይለ ስላሴ	ሃይለ ሥላሤ ሃይለ ሥላሴ ሃይለ ስላሤ ሃይለ ስላሴ	ሐይለ ሥላሤ ሐይለ ሥላሴ ሐይለ ስላሤ ሐይለ ስላሴ	ሓይለ ሥላሤ ሓይለ ሥላሴ ሓይለ ስላሤ ሓይለ ስላሴ
Amharic	ዐለምጸሀይ ዐለምጸሃይ ዐለምጸሐይ ዐለምጸሓይ ዐለምጸኀይ ዐለምጸኃይ ዐለምጸኻይ	ዐለምፀሀይ ዐለምፀሃይ ዐለምፀሐይ ዐለምፀሓይ ዐለምፀኀይ ዐለምፀኃይ ዐለምፀኻይ	ዓለምጸሀይ ዓለምጸሃይ ዓለምጸሐይ ዓለምጸሓይ ዓለምጸኀይ ዓለምጸኃይ ዓለምጸኻይ	ዓለምፀሀይ ዓለምፀሃይ ዓለምፀሐይ ዓለምፀሓይ ዓለምፀኀይ ዓለምፀኃይ ዓለምፀኻይ		ኀይለ ሥላሤ ኀይለ ሥላሴ ኀይለ ስላሤ ኀይለ ስላሴ	ኃይለ ሥላሤ ኃይለ ሥላሴ ኃይለ ስላሤ ኃይለ ስላሴ	ኻይለ ሥላሤ ኻይለ ሥላሴ ኻይለ ስላሤ ኻይለ ስላሴ
Tigrigna (Eritrea)		አለምጸሀይ አለምጸሃይ				ሀይለ ስላሴ	ሃይለ ስላሴ
Tigrigna (Ethiopia)	አለምጸሀይ አለምጸሃይ አለምጸኀይ አለምጸኃይ	አለምፀሀይ አለምፀሃይ አለምፀኀይ አለምፀኃይ	ኣለምጸሀይ ኣለምጸሃይ ኣለምጸኀይ ኣለምጸኃይ	ኣለምፀሀይ ኣለምፀሃይ ኣለምፀኀይ ኣለምፀኃይ		ሀይለ ሥላሤ ሀይለ ሥላሴ ሀይለ ስላሤ ሀይለ ስላሴ	ሃይለ ሥላሤ ሃይለ ሥላሴ ሃይለ ስላሤ ሃይለ ስላሴ	ኀይለ ሥላሤ ኀይለ ሥላሴ ኀይለ ስላሤ ኀይለ ስላሴ	ኃይለ ሥላሤ ኃይለ ሥላሴ ኃይለ ስላሤ ኃይለ ስላሴ
Ge'ez	አለምጸሀይ አለምጸሃይ		ኣለምጸሀይ ኣለምጸሃይ			ሀይለ ስላሴ	ሃይለ ስላሴ

To be certain, while the renderings shown are logically possible they are not all necessarily probable, though the character classes used are entirely valid. It should be emphasized also that the renderings shown for languages following Amharic do not indicate the acceptable spellings in those languages but demonstrate how the pattern matching outcome would change with the corresponding locales. Indeed when searching for the same terms in a document known to be in the language indicated a matching pattern appropriate for the language would be applied.

The Syllabic Constraint

Owing to the nature of language morphology where derivational rules are developed that rely on consonants and vowels as disassociated entities we expect to be able to apply these rules to written language. In an open syllabary the two (consonant and vowel) are fused together and so a developer is driven to seek out or create tools to isolate these character properties. Regular expressions resources are now indispensable in this field but are somewhat cumbersome to use when applied to syllabaries. The C-V property of a syllable corresponds directly to the rows and columns of the syllabary itself. The syllabic and form classes allow us to match a single character as a member of a group. This is analogous to folding cases ([:ለ:]) or specifying a specific case ([#4#]). However, a limitation appears when we want to specify intersections of the two. An anticipated pitfall would be to attempt:

[መበቀ][#2,4-7#]

which matches two characters in sequence and not a member of the intersection:

[ሙማ-ሞቡባ-ቦቁቃ-ቆ]

A convenient solution is to apply the same logic of the syllabic form matching class in constraint notation.

[ሙማ-ሞቡባ-ቦቁቃ-ቆ] --> [መበቀ]{#2,4-7#}

or in the negative expression:

[መበቀ]{^#1,3,8-#}

Applied in a small practical example we can develop an expression for the detection of the basic Amharic plural. We can start by defining a word stem as a sequence of syllables (assumed in the Ethiopic context):

$stem = "[:syllable:]+";

which without the utility of the constraint becomes:

/^$stem(({#4#}[ቱታቴት])|({#7#}[ቹቻቼች]))/

and with the expressive power of a syllabic constraint condenses nicely to:

/^$stem(({#4#}ት)|({#7#}ች)){#2,4-6#}/

This is of course a very simple example and it is intended that the power of the operator be evident from visual inspection by those only casually acquainted with regular expressions syntax. Applied to very large and real world text and natural language processing problems we can expect the operator to become an indispensable member of the regular expression toolbox.