A small corpus of a local newspaper (˗Pamɛbhamɛ), and medical counsels (chapters) from While waiting for a medical doctor translated into Eastern Dan. This corpus description also attempts a modest application of the principles set out by Martin Hosken for Writing System Descriptions.39
- ISO 639-3 language tag: [dnj]
- Language Name: Dan
- Main location of language use: Ivory Coast (Côte d'Ivoire)
- Language variety demonstrated in this corpus: Eastern Dan
- Script: Latin script.
- Sociological-dynamics of writing: Dan has been written for at least 40 years (1978)1. Written tradition has been heavily influenced by French, according to how French is written in Côte d'Ivoire.
- Main user base: Of approximately 1.65 Million Dan users 650,000 are users of Eastern Dan.2
- Multi-lingualism: A high percentage of Dan users are multi-lingual in French [fra] (oral and written competencies) and Interethnic Jula [dyu] (oral); some have competencies in Guéré [wec] (oral) and Wobé [wob] (oral).33
- Digital language use: Some digital language use has been noted in the past (2008). Some digital use in SMS and Facebook is expected.
- Digital Support Infrastructure: None. (Locale data in CLDR, Keyboard layout, spell check, text-to-voice, voice-to -text, Part of-speech tagging, etc.)
Language Note: Dan is considered by some to be a macro language comprised of a dialect chain of over 40 dialects 3,4. As recently as 2012 the ISO 639-3 registrar approved a request (2012-083)5 to split one of these dialects off into its own language (Kla [lda]). Eastern and Western Dan have had their own separate writing traditions for over 40 years. There are significant segmental and suprasegmental differences between Eastern and Western Dan.
Script Note: There may be several orthographies from different dialects which would all qualify as BCP476: dnj_Latn_CI. Crúbadán language data for Eastern Dan uses: dnj-x-east
7but it is unclear if that corpus is based on the same orthography this corpus (orthography version 3), even if the language content is from the same language variety.
Font Note: It has been Hugh's professional experience that in many cases fonts used to encode minority languages often fail to include two very important features. The first is that some classes of diacritics and characters do not combine elegantly for users. For instance: 〈◌̊〉 U+030A 'COMBINING RING ABOVE', does not elegantly combine with 〈🦄〉 U+1F984 'UNICORN FACE' to allow users to put a ring on the unicorn's horn‽ The second case impacts the fluidity of grammatical expression by minority language users. Most fonts don't support 〈‽〉 U+203D 'INTERROBANG'.
Text Rendering Note: It appears that many fonts do not successfully render some glyphs from the Dan orthography. This is especially noticeable with regards to two sets of glyphs: 〈Ʋ̈, ʋ̈〉'LATIN LETTER V WITH HOOK + COMBINING DIERESES', 〈˗〉 U+02D7 'MODIFIER LETTER MINUS SIGN', and 〈꞊〉 U+A78A 'MODIFIER LETTER SHORT EQUALS SIGN'. The issue with the Latin letter V with hook is that generally the height of the base character (when it is supported in fonts) is set too high for the line height to accommodate the adding of combining diereses on top of the base character. Font rendering engines then push the combining diereses to the right. Default fonts in web browsers are particularly susceptible to the issue of pushing the combining diereses to the right. The second issue is that 〈꞊〉 U+A78A and 〈˗〉 U+02D7 are set to display at half the vertical height of lower case letters. However, it is often the case that these glyphs are rendered adjacent to uppercase letters. This gives the visual effect that the modifier letters are too low, or too small for practical use. CharisSIL and DoulosSIL (the Unicode compliant versions) do render all glyphs correctly. These fonts can be used as embeded fonts, but it would be nice if professional font makers would enable Dan users (and other minority language users) to have a variety of type face options.
Image provided by Ian Douglas, rendered in LibreOffice
A list with examples of successful fonts is provided in dnj-Font-Face/dnj-fra-successful-rendering-fonts.pdf
. Contribution by Ian Douglas.
Orthography Note: It can be, and is in fact the case in Dan, that there are multiple writing systems for different speech varieties of the same ISO 639-3 designated language, simultaneously. That is separate groups (socio-logical, or dialectical, or both) , are writing the same "language" in different ways at the same time, and these seperate groups have iterated the way they write their varieties over time.
Developmental Note: Based on the narrative developed in the literature, evolutionary steps in the development of community literacy - including the progressive refinement of the orthography — taken under the mentors Margrit Bolli & Eva Flik generally focused on Western Dan first and then soon after or simultaneously was adapted to Eastern Dan. A distinct narrative for Eastern Dan, independent from Western Dan does not appear until 1982. However, some literacy was happening in Eastern Dan under their mentorship as early as 1972.
Version | Date | Evolutionary steps | Mentor/Artist | Reference |
---|---|---|---|---|
Version 0.1a | pre-1970 protestant | Imported from Liberia | Mission Biblique | R & V Forthcoming8. |
Version 0.1b | pre-1970 catholic | concurrent with but separate from version 0.1a | Roman Catholic Church | R & V Forthcoming9. |
Version 0.2 | pre-1972 | high tone is marked at the begining of the word with an apostrophe | Margrit Bolli / Eva Flik | Margrit Bolli37. |
Version 0.3 | 1974 | ?? | Margrit Bolli / Eva Flik | Tiémoko Sébastien Baba 10 (reader; no orthography statement) R & V Forthcoming11 |
Version 0.4 | 1978 | full stop 〈.〉 is at the beginning of words to indicate low tone, 〈ô〉is used, 〈.CVV'-〉 is a tone pattern used to indicate low-mid-fall | Margrit Bolli / Eva Flik | Marking tone with Punctuation38 (In this resource the author does not indicate if they are discussing Eastern Dan, Western Dan, or both. In the 1982 version of the Western Dan reading primer the word final apostrophe hyphen sequences is present.) |
Version 1 | 1982-1990 | No indication of full stop 〈.〉 usage at the beginning of words. No indication of word final apostrophe hyphen sequences 〈CVV'-〉. | Margrit Bolli / Eva Flik | Bolli & Flik12(Transitional Primer) |
Version 2 | 1994 | The start of using double U+0022 at the end of words appears in a course book for learning to read. The letters 〈ɩ〉, 〈ʋ̈〉, 〈ʋ〉 appear, which did not appear in orthography version 1. | Margrit Bolli / Eva Flik | Bolli & Flik13 (Transitional Primer) |
Western Dan | 2000 | In Western Dan Biblical text preprints (for community circulation) use U+2013 instead of U+002D to indicate tone. (Forever muddling which character is correct in all future writing.) | Margrit Bolli / Eva Flik | See Ruth14 and Jonah15 Published in 2000. |
Version 3 | (2005??)-2014 | These texts contain U+201C, U+201D, and U+0022 as tone markers before and after words. (It might have been the idea that only U+0027 would be used twice and that human input habits chose to input U+0022 as a quicker step, and then word processing software auto-corrected some of these to U+201C, and U+201D) | Margrit Bolli/Valentin Vydrin | This corpus is representative of this stage in the orthography. |
Version 4 | 2014-2017+ | There are significant changes to vowel and tone markers. In general away from digraphs towards single graphemes, and away from pre and post stem tone indication via punctuation towards diacritic indication of tone over the the stem. | Valentin Vydrin | Roberts, Brown, Vydrin Forthcoming16, R & V Forthcoming17, V & R Forthcoming18 |
The data has two states.
- As first received from sources. (as original files and as the consolidation of the extracted text from those original files:
initial-starting-corpus.txt
) - As finally processed for use in Keyboard layout analysis:
proof-of-concept-text.txt
,phonemic-corpus.txt
The reason for these two states is to faithfully represent the corpus as it was originally received. It is felt that this state most faithfully represents the text processing and publishing "natural language use" which Dan Language users encounter. However, to do the keyboard optimization, it is important to look at the intended characters that language users thought they were using. It is quite evident that automation has changed a great deal of the intended characters into something unintended. This intended state is what is needed to optimize a keyboard layout.
- BCP47: dnj_Latn_CI
(But this tag needs to take into account the following two points and doesn't.)
- Eastern Dan
- Orthography version: 3
Writing System Note: When orthography version 3 was established, the target technology for implementation of text the writing system was French typewriters.34 As technology advanced (the event of Unicode), the indication of tone often became confusing. Well, only confusing in the sense that the most frequently chosen characters by Dan authors would normally use the Unicode attributes for punctuation. And it is these characters before or after the stem (word) that indicate the pitch melody of the orthographic word. These characters are not used in expected ways according to their Unicode attributes as encoded in the original documents for this corpus. Now, it is true that there are Unicode characters which do have the same visual characteristics and also have letter attributes instead of punctuation attributes. These letter characters are recommended as a best practice in orthography development.36 However, enabling Dan writers to encode their language with the the most appropriate Unicode characters has been a challenge. As a result many applications do not properly typeset or interact with Dan "words" in the ways that many users of "global" languages expect. This and the influence of French writing norms has resulted in the evolution of a unique print media culture for users of Dan. From observing the corpus five notable, and previously undiscussed instances present themselves:
- The use of space around proper punctuation marks is not always as one would expect for an orthography written in a Latin script. That is, it is not uncommon to see something like
ˮban˗ ? ꞊Yaa˗
where there are extra spaces around the question mark. Presumably this is to provide visual clarity for mental processing of punctuation marks. - While French allows for apostrophe in the middle of words to show elision
qu'en
, Dan does not. In fact Dan, to the best efforts given the knowledge available, does not need to use the apostrophe and uses the glyph to indicate tone — something much different than the use dictated by French. In the corpus, there are cases where a space follows an apostrophe in French words, indicating that at some level mixed language texts are typographically being processed as Dan language texts. - The hyphen in French can take on several linking usages:
- It can connect morphology
celui-ci
or parts of speech (infinitive + pro noun)aide-moi
- It can occur in set expressions like
vis-à-vis
- It can occur in hyphenated names like
Jean-Luc
Dan, however does not have these same typographical liberties with the usage of the hyphen glyph indicating tone. There are several cases in the corpus where it was observed that a space was not separating hyphen from two (otherwise distinct) words. Judgment calls were made to insert spaces to fix this in the final corpus used in analysis.
- Similar visually to the hyphen, is the Dash. Dash, at least in French typographical tradition, is set off with spaces on each side. 'EN DASH' is observed in the corpus. In fact sometimes it is observed with spaces surrounding it — but so is hyphen (and sometiemes these are in the same phrases). Therefore it is really difficult (no doubt for native writers and readers too) to determine if, Dash is correctly used and typographically indicating a Dash or typographically a tone mark. In the French typographic tradition dashes can serve several functions:
- It can enumerate the elements of a list
- It can emphasize a comment
- It can indicate a change of speaker.
It is not clearly laid out how Dan writing system(s) (1978, 1982, 1994, 2000, 2014) handle these functions in print media. One possibility is to use a rounded glyph like a bullet for some of these functions (though the actual future of this need is in question as orthography version 3 is potentially giving way to version 4). Pedagogically punctuation, especially for discourse functions (typically beyond the simple sentence), should likely become part of the training provided in Dan literacy programs. In the past a deconstructionist approach35 highlighting the differences between French and Dan, has been taken for users of French learning to read Dan. This approach has been successful. Perhaps the same approach with a learning unit on word boundaries and discourse level punctuation, would increase the confidence and clarity of Dan writers.
- Typographically expressing more than one language in a document is confusing to authors. Some authors when writing in Dan and referencing a French word will put the word in parenthesis, other authors use type face to distinguish languages and at least one instance was found of using English style smart quotes to set off French words. All of these use strategies preserve the use of French quotes for direct speech usage — commonly called 'quotes'. The evolution of print media and the evolution of typographic tradition in Eastern Dan (and other language which often generate multi-lingual documents, especially if they use punctuation to indicate tone) would benefit form a standardized method of indicating a language change (code switch) with in the document. One possibility would be the introduction in the curriculum of other uses for quote marks.
- The use of French style quote marks 〈«〉, 〈»〉 is confusing to Dan authors. That is, opening and closing quote marks appear to be used interchangeably in opening quotations. Additionally, there are quite a few cases where closing quote marks are missing. If software engineers for grammar and spelling checkers can manage, adding a function which checks for closing quote marks (of any kind), much like is done for programmers in IDEs, would benefit many new writers of minority languages.
Writing system, orthographic, linguistic, and alphabet descriptions for encoding of text in Eastern Dan version 3.
The closest thing to a formal writing system description for Eastern Dan is a 199419 community oriented reader which covers, Vowels, Consonants, Numbers, and punctuation. The 1994 reader improves upon a 1982 community oriented reader20 by offering sections on numbers and punctuation. However, neither book presents an alphabetic order, or an alphabet in whole (all at one time). In, fact because the readers are designed for transitional learners, coming from French, the mode of comparison is to French writing. The comparisons to French writing, and pedagogical assumptions what Dan readers/writers already know about French are so strong, that one might ask: "is the presentation of writing in Dan 'French orthography adapted for Dan', or is it a 'unique writing system for Dan' ready to stand on its own and greet a world of writing systems"? Several forthcoming works do offer a formal linguistic description of the orthography, orthography testing, and a newly proposed orthography, but these works fail to provide details at the technical and writing system levels, focusing rather on the correspondences between linguistic units and typographical units.
In this section a short prose discussion is followed by a chart. Charts are followed by list presented in crucial ordering for tokenization by the python library segments.21 Note: the graphemes used here, with the exception of those recommended for special status by RFC398622 are presented because they are evidenced in the corpus.
These definitions and conventions are observed throughout this work:
- An alphabet is a list of letters used to transcribe a language. Alphabets usually have an order for pedagogical purposes, and for dictionary sorting purposes. At a technical level, SIL's NRSI23 provides this: a segmental writing system having symbols for individual sounds, rather than for syllables or morphemes. In a true alphabet, consonants and vowels are written as independent letters, in contrast to an abugida or an abjad. In a perfectly phonemic alphabet, phonemes and letters would be predictable in both directions; that is, the sound of a word could be predicted from its spelling and vice-versa. A phonetic alphabet is also predictable in this way, however it uses separate letters for separate allophones, whereas a phonemic alphabet may describe allophones of the same phoneme using a single letter.
- Letters are typographical units for the purposes of pedagogy.
- Characters are single Unicode code points.
- Graphemes are typographical units. Often in a writing system these units carry meaning.
- Multigraph (from SIL's NRSI) a combination of two or more written symbols or orthographic characters (e.g. letters) that are used together within an orthography to represent a single sound. (Combinations consisting of two characters are also known as digraphs.).
- A digram is a sequence of two graphemes. Whereas a digraph is a sequence of two letters to indicate a single sound, a digram is any sequence of two units in an orthography, sometimes this term is used in the literature synonymously with bigram. In literature that uses the terms digram/bigram sometimes the compared units are whole words, or syllables.
- A linguistic description would include phonetic or phonological details for the characters used in the encoding of the text.
- A list of phonemes is a list of unique and contrastive sound units in a language. Many times an alphabet is based on a list of phonemes. But to the extent that two typographical characters are used together in a pattern (digraph) to indicate when co-occurring that they represent a phoneme then an alphabet might have fewer letters/components than a list of phonemes in the same language.
- A writing system description includes things like casing correspondences, usage rules for casing, punctuation characters, usage rules for punctuation marks, letters, numbers, and characters used in Internet use, with their Unicode code points used in technical encodings. A writing system description, more than just an orthography is needed to fully support a language on digital tools. It is necessary for creating a Locale description and is useful for creating a custom Keyboard layout, and other Natural Language Processing Tools.
- As laid out by Peter Constable,40 a Writing System is a superordinate category of a collection of technologies and/or metadata on how an orthography is to be implemented. The following image situates the terms and relationships around orthographies and languages.
- The following characters are used to provide special meaning to text outside of tables:
- Content within square brackets denotes either phonetic representations (such as allophones) or ISO639-3 codes
[]
. - Content within forward slashes denotes phonemic representations
//
. - Content within angle brackets denotes orthographic or graphemic representations
〈〉
. - Content within double-slashes or pipes denotes morphophonemic representations
// //
or| |
. - In prose sections, Unicode characters will appear in the following order upon first mention: 〈‽〉 U+203D 'INTERROBANG'. A more natural prose style will be used for subsequent mentions (using any one of these three parts).
- Content within square brackets denotes either phonetic representations (such as allophones) or ISO639-3 codes
No specific casing rules are discussed.
Based on data within the corpus as originally delivered, casing rules appear to follow general French casing norms, with two noted exceptions.
- Tone marks preceding the non-tone mark portion of the word do not get capitalized, but the characters following the tone marks [a-zA-Z] do get capitalized. Yet tone marks are considered part of the word and should not have word breaks between them and the words they belong with.
- The first word of a sentence is capitalized.
- Proper nouns are capitalized.
- Unlike French where, when an article is the first word of a sentence both the first word and the second word are capitalized, in Eastern Dan only the first word is capitalized.
- Surnames are not capitalized as is the custom in French literature.
- Uppercase can be used as a style choice in titles of creative works, much as is the case in many languages, which use a Latin script.
- Only the first letter of a digraph is capitalized. i.e. 〈"Ɛa-〉 is correct whereas 〈"ƐA-〉 is not.
Orthographic word breaks are indicated by a space, generally U+0020. Because Eastern Dan uses characters which look like punctuation, and often the actual punctuation characters are used, it is has been common practice to overcompensate to keep characters representing tone attached to the rest to the string that represents the word. This is demonstrated in the corpus, as it was originally delivered.
The use of normal text editors with the standard characters for the glyphs representing tone result in line and word breaks which are unexpected for Eastern Dan readers and Writers. The solution for the orthography version 3 is to use 'MODIFIER LETTER' equivalent characters for tone marks, instead of standard characters found in many of the global languages using Latin scripts for these glyphs.
Word break rules are not discussed. But reading is taught with single words bounded by spaces. This occurs at the sentence level too. One may assume that tone marks should never separate from the rest of their word. It would just be weird to insert a hyphen into a word that uses a hyphen as a letter. So presumably hyphenation is not allowed in this orthography either.
Various kinds of special characters are used in the corpus as it was originally delivered, to prevent word breaks in undesired places. Sometimes 〈 〉 U+00A0 'NO-BREAK SPACE' and sometimes 〈‑〉 U+2011 'NON-BREAKING HYPHEN' was used to control line and word breaking behavior.
The readers' guide says that, in general the orthography for Dan utilizes "les mêmes signes" of punctuation of the orthography of French. Unicode version 1.0 was released in 1991, and by 1994 was at version 1.1.0. So it is highly unlikely that the authors of the literacy primers were thinking about matching their orthography symbols to Unicode characters. Unicode codepoints are provided here as an added point of reference. They are not in the source text.
Codepoint | Grapheme | Usage |
---|---|---|
U+00AB | « | les guillemets ouvrant et (tr. [eng]: opening indicator for marking a quote) |
U+00BB | » | fermant un discourse direct (tr. [eng]: closing indicator for marking a quote) |
U+0021 | ! | le point d'interrogation marque la présence d'une exclamation (tr. [eng]: following an exclamation) |
U+003B | ; | le point-virgule entrecoupe deux parties d'une longue phrase (tr. [eng]: joins two long phrases) |
U+003C | < | les guillemets simples ouvrant et (tr. [eng]: opening indicator for marking a quote inside a quote) |
U+003E | > | fermant un discourse direct placé dans un autre discourse direct (tr. [eng]: closing indicator for marking a quote inside a quote) |
U+003F | ? | le point d'interrogation marque la présence d'une question (tr. [eng]: following a question) |
U+002E | . | le point marquant la fin d'une pensée (tr. [eng]: finishing a thought) |
U+002C | , | la virgule donne l'occasion de prendre haleine (tr. [eng]: taking a breath) |
U+003A | : | le double point marque le début d'un discourse direct (tr. [eng]: marking the start of a quote) |
Based on data within the corpus, as it was originally delivered, the following punctuation marks are observed. Their usages, as far as can be determined, from the corpus are indicated in the table. ( what about ˮlʼautre jourˮ)
Codepoint | Grapheme | Usage |
---|---|---|
U+00B0 | ° | Used as part of the abbreviation for number 〈n°〉. |
U+005F | _ | Error - should be U+02D7 |
U+005B | [ | unknown |
U+005D | ] | unknown |
U+2026 | … | unknown |
U+201A | ‚ | Error - Should be U+002C |
U+002F | / | unknown |
U+00AB | « | Open a direct speech statement - Usage seems to vary between open and close. |
U+00BB | » | Closes a direct speech statement - Usage seems to vary between open and close. |
U+0021 | ! | Closes an exclamation, interjection or emphatic statement |
U+003B | ; | unknown |
U+2039 | ‹ | Opens a quote inside of a direct speech statement |
U+203A | › | Closes a quote inside of a direct speech statement |
U+003C | < | Error - Most cases are double i.e. << and should be replaced with U+00AB; other cases should be U+2039 |
U+003E | > | Error - Most cases are double i.e. >> and should be replaced with U+00BB; other cases should be U+203A |
U+003F | ? | Closes a question statement |
U+002E | . | Completes a thought, occurs between numbers. |
U+002C | , | unknown |
U+0029 | ) | Closes a parenthetical. Often a number, but sometimes a word in another language, or an alternate transcription of a name. |
U+0028 | ( | Opens a parenthetical. Often a number, but sometimes a word in another language, or an alternate transcription of a name. |
U+003A | : | unknown |
U+002B | + | Precedes a telephone number to indicate country code, used to conjoin thoughts. eg. xH-tone + Mid-tone |
°
_
[
]
…
‚
/
»
«
!
;
‹
›
<
>
?
.
,
)
(
:
+
Unfortunately no math symbols or other numeric related characters are provided. Unicode codepoints are provided here as an added point of reference. They are not in the source text.
Codepoint | Grapheme |
---|---|
U+0030 | 0 |
U+0031 | 1 |
U+0032 | 2 |
U+0033 | 3 |
U+0034 | 4 |
U+0035 | 5 |
U+0036 | 6 |
U+0037 | 7 |
U+0038 | 8 |
U+0039 | 9 |
As evidenced in the corpus, as it was originally delivered, when writing Eastern Dan with the Latin script the following numbers are used.
Codepoint | Grapheme |
---|---|
U+0030 | 0 |
U+0031 | 1 |
U+0032 | 2 |
U+0033 | 3 |
U+0034 | 4 |
U+0035 | 5 |
U+0036 | 6 |
U+0037 | 7 |
U+0038 | 8 |
U+0039 | 9 |
0
1
2
3
4
5
6
7
8
9
Number oriented notes:
- Thousands separator is 〈.〉 U+002E 'FULL STOP'.
- The is a shortened form of the word "number" in many transcription traditions. Unicode has a special character for this 〈№〉 U+2116 'NUMERO SIGN'. Typographical norm in Dan appear to follows French social practice, rather than best practice for encoding. This was evidenced only one time in the corpus and is the source of 〈°〉 U+00B0 'DEGREE SIGN', and likely deserves further investigation before strong claims are made about what method should be used in Eastern Dan writing. Wikipedia suggests that "the numero symbol is not in common use in France and does not appear on a standard AZERTY keyboard. Instead, the French Imprimerie nationale recommends the use of the form 〈no〉 (an 〈n〉 followed by a superscript lowercase 〈o〉). The plural form 〈nos〉 can also be used. In practice, the 〈o〉 is often replaced by the degree symbol 〈°〉, which is visually similar to the superscript 〈o〉 and is easily accessible on an AZERTY keyboard."24
- Telephone numbers are written in series of two digits. These digits can be separated with 〈.〉 U+002E or spaces.
grep -n -P "\s\d" proof-of-concept-text.txt
- A list of numbers is separated by a comma and a space. e.g. 〈1, 2, 3〉
According to RFC 3986 25the following characters are needed for reasonable Internet use in the URL and URI syntax. In the Internet domain these characters can sometimes have a reserved meaning. Therefore they should be given appropriate consideration in all orthographies. So while their typographical function may or may not be present in the everyday writing of Eastern Dan, as Eastern Dan speakers become more digitally active with their language, these characters will increase in their usage by Eastern Dan language users.
This does not preclude any language based denotation that the orthography may make on these characters. For instance there is a long typographical history in Eastern Dan of using 〈=〉 U+003D 'EQUALS SIGN' as a tone marking character. It is even the case that the original text of this corpus was encoded with this character, no doubt for practical reasons of keyboard accessibility. However the more appropriate character is 〈꞊〉 U+A78A 'MODIFIER LETTER SHORT EQUALS SIGN'. Typographically across fonts, it is common that these characters appear the same, however their Unicode properties are different. U+A78A can not be substituted for Internet use and U+003D will not properly join with other text to form words in text processing software. By way of analogy, just because the Internet does not use the same quote marks that French and Eastern Dan do, does not mean that these languages need to change, only that accessing these characters and their social contribution is a needed consideration in orthography statements and written language development.
Unmentioned in RFC 3986 is the use of 〈"〉 U+0022 'QUOTATION MARK', 〈>〉 U+003E 'GREATER-THAN SIGN', and 〈<〉 U+003C 'GREATER-THAN SIGN' which are all highly important in various mark-ups like HTML526. Markdown27, a common text markup language, requires 〈`〉 U+0060 'GRAVE ACCENT', 〈|〉 U+007C 'VERTICAL LINE', and 〈\〉 U+005C 'REVERSE SOLIDUS'. The following table represents RFC 3986 plus 〈", <, >, |, `, \ 〉. Many of these characters are evidenced in the corpus. However some are not evidenced.
Codepoint | Grapheme |
---|---|
U+0021 | ! |
U+0022 | " |
U+0023 | # |
U+0024 | $ |
U+0025 | % |
U+0026 | & |
U+0027 | ' |
U+0028 | ( |
U+0029 | ) |
U+002A | * |
U+002B | + |
U+002C | , |
U+002D | - |
U+002E | . |
U+002F | / |
U+003A | : |
U+003B | ; |
U+003C | < |
U+003D | = |
U+003E | > |
U+003F | ? |
U+0040 | @ |
U+005C | \ |
U+005B | [ |
U+005D | ] |
U+005F | _ |
U+0060 | ` |
U+007C | | |
U+007E | ~ |
%
:
/
?
#
[
]
@
!
$
&
'
(
)
*
+
"
,
;
=
-
.
_
~
"
`
|
>
<
The Internet was not discussed in the 1994 reading primer.
This corpus does not represent Internet communication, therefore it seems a bit presumptive to suggest that any character in this corpus represents use on the Internet. Though this should be a consideration for keyboard layout and text production tools for Eastern Dan.
It is worth noting that the local paper evidently did have some online presence at www.pamebhame.info. This was some time around 2008. A quick check of the Internet archive shows that no content was preserved in the Internet archive.
Actually neither the 199428 reading primer nor the 198229 reading primer present or address the issue of an alphabet, or alphabetical ordering. Both resources present their audiences with a list of pedagogical learning units which match well with the phonemics of Eastern Dan (with a few exceptions). They present these in functional units (a term I borrow from Holm 197130 and Venezky 197031 196732), ordered and grouped by place of articulation (phonetic detail). Therefore, as according to the information which is available, it would appear that no alphabet statement has been made for Eastern Dan.
That said, a letter list should be possible, and relevant to this section, though any ordering presented here would only be for practical reasons, and is not intended to be prescriptive. In this presentation I present diacritics as a component of the letters on which they occur. I do this because the diacritics themselves do not have a consistent meaning in the orthography. I do not list every functional unit, only the letters from which functional units can be created. This is true for vowels, tone patterns, and double articulated consonants. Based on the letters presented in the 1994 primer the following letters would need to be in an alphabet. This list includes 36 letters; 32 with casing pairs for a total of 68 alphabetic graphemes. A list of functional units will be presented in a separate section. CSV of this table,Text string of uncased letters followed by case matched letters
Uppercase | Lowercase | Glyph | Glyph | Approximate Unicode Name |
---|---|---|---|---|
NFD Encoding | NFD Encoding | Full Unicode Names contain 'CAPITAL' or 'SMALL'. | ||
U+0041 | U+0061 | A | a | LATIN LETTER A |
U+0042 | U+0062 | B | b | LATIN LETTER B |
U+0044 | U+0064 | D | d | LATIN LETTER D |
U+0045 | U+0065 | E | e | LATIN LETTER E |
U+0045 U+0308 | U+0065 U+0308 | Ë | ë | LATIN LETTER E with COMBINING DIAERESIS |
U+0046 | U+0066 | F | f | LATIN LETTER F |
U+0047 | U+0067 | G | g | LATIN LETTER G |
U+0048 | U+0068 | H | h | LATIN LETTER H |
U+0049 | U+0069 | I | i | LATIN LETTER I |
U+004B | U+006B | K | k | LATIN LETTER K |
U+004C | U+006C | L | l | LATIN LETTER L |
U+004D | U+006D | M | m | LATIN LETTER M |
U+004E | U+006E | N | n | LATIN LETTER N |
U+004F | U+006F | O | o | LATIN LETTER O |
U+004F U+0308 | U+006F U+0308 | Ö | ö | LATIN LETTER O with COMBINING DIAERESIS |
U+0050 | U+0070 | P | p | LATIN LETTER P |
U+0052 | U+0072 | R | r | LATIN LETTER R |
U+0053 | U+0073 | S | s | LATIN LETTER S |
U+0054 | U+0074 | T | t | LATIN LETTER T |
U+0055 | U+0075 | U | u | LATIN LETTER U |
U+0055 U+0308 | U+0075 U+0308 | Ü | ü | LATIN LETTER U with COMBINING DIAERESIS |
U+0056 | U+0076 | V | v | LATIN LETTER V |
U+0057 | U+0077 | W | w | LATIN LETTER W |
U+0059 | U+0079 | Y | y | LATIN LETTER Y |
U+005A | U+007A | Z | z | LATIN LETTER Z |
U+0186 | U+0254 | Ɔ | ɔ | LATIN LETTER OPEN O |
U+0190 | U+025B | Ɛ | ɛ | LATIN LETTER OPEN E |
U+0196 | U+0269 | Ɩ | ɩ | LATIN LETTER IOTA |
U+01B2 | U+028B | Ʋ | ʋ | LATIN LETTER V WITH HOOK |
U+01B2 U+0308 | U+028B U+0308 | Ʋ̈ | ʋ̈ | LATIN LETTER V WITH HOOK with COMBINING DIAERESIS |
N/a | U+02BC | ʼ | MODIFIER LETTER APOSTROPHE | |
N/a | U+02D7 | ˗ | MODIFIER LETTER MINUS SIGN | |
N/a | U+02EE | ˮ | MODIFIER LETTER DOUBLE APOSTROPHE | |
N/a | U+A78A | ꞊ | MODIFIER LETTER SHORT EQUALS SIGN | |
NFC Encoding | NFC Encoding | |||
U+00CB | U+00EB | Ë | ë | LATIN LETTER E WITH DIAERESIS |
U+00D6 | U+00F6 | Ö | ö | LATIN LETTER O WITH DIAERESIS |
U+00DC | U+00FC | Ü | ü | LATIN LETTER U WITH DIAERESIS |
None | None | Ʋ̈ | ʋ̈ | LATIN LETTER V WITH HOOK with COMBINING DIAERESIS (NFD and NFC are the same as this is a composed form only character) |
Functional units are sets of graphemes that work together to bring meaning to a reader. In the English 〈ch〉 functions as a functional unit. The reader needs to parse the letters as a single unit as they map an orthographic representation to a phonological representation.
The following is a list of functional units presented with both sets of casing options. Because these functional units, it is assumed that there is some level of phonemic reality to which these graphical units relate. The tonal patterns are written as Perl RegularExpressions in angle brackets. \s
indicates a space (word boundaries), \p{L}
indicates some letter(s), and the tone marks themselves - represent themselves.
A a
Aa aa
An an
Aan aan
Aɔ aɔ
Aɔn aɔn
Bh bh
D d
Dh dh
E e
Ee ee
Ɛ ɛ
Ɛɛ ɛɛ
Ɛa ɛa
Ɛan ɛan
Ɛn ɛn
Ɛɛn ɛɛn
Ë ë
Ëë ëë
Ën ën
Ëën ëën
F f
G g
Gb gb
Gw gw
I i
In in
Ii ii
Iin iin
Ɩ ɩ
Ɩɩ ɩɩ
K k
Kp kp
Kw kw
L l
M m
N n
Ng ng
O o
Oo oo
Ö ö
Öö öö
Ɔ ɔ
Ɔɔ ɔɔ
Ɔn ɔn
Ɔɔn ɔɔn
P p
R r
S s
T t
U u
Uu uu
Un un
Uun uun
Ü ü
Üü üü
Ün ün
Üün üün
V v
W w
Y y
Z z
Ʋ ʋ
Ʋʋ ʋʋ
Ʋ̈ ʋ̈
Ʋ̈ʋ̈ ʋ̈ʋ̈
iʋ̈
iö
ië
ia
ian
ɩa
uë
ʋë
ʋ̈ü
〈ˮ\p{L}\s〉
〈ʼ\p{L}\s〉
〈\s\p{L}\s〉
〈꞊\p{L}\s〉
〈˗\p{L}\s〉
〈ˮ\p{L}˗〉
〈ʼ\p{L}˗〉
〈\s\p{L}˗〉
〈꞊\p{L}\s˗〉
〈\s\p{L}ʼ〉
〈\s\p{L}ˮ〉
Phoneme chart (Oral)SIL1982,V&K 2008,Ch10
Linguistically, Eastern Dan is claimed to have a 12 point vowel system with length, pitch, and nasalization distinctions. Length has been analyzed as two sequential vowels. Pitch patterns are covered under the tone marking section. Nasalization occurs phonemically on 9 vowels. The velar nasal /ŋ/, orthographically indicated as 〈ng〉, is linguistically considered a vowel in Eastern Dan.SIL1982,V&K 2008 This bring the total to 22 vowels.
Oral | Front Unrounded | Back Unrounded | Back Rounded |
---|---|---|---|
Close | i | ɯ | u |
Near-close | |||
Mid | e | ɤ | o |
Open-mid | ɛ | ʌ | ɔ |
Near-open | æ | ||
Open | a | ɒ |
Nasal | Front Unrounded | Back Unrounded | Back Rounded |
---|---|---|---|
Close | ĩ | ɯ̃ | ũ |
Near-close | |||
Mid | |||
Open-mid | ɛ̃ | ʌ̃ | ɔ̃ |
Near-open | æ̃ | ||
Open | ã | ɒ̃ |
/ŋ/
Allophonic variation of vowels does occur. In some dialects these allophones have been considered phonemic. However the phonemic status is not attested ubiquitously in Eastern Dan. Eastern Dan's writing system attempts to be pan-dialectical. This accounts for the addition of three letters 〈ɩ〉, 〈ʋ〉, and 〈ʋ̈〉 between the 1982 and the 1994 versions of the reading primers. These allophones are the result of Extra High Tone interaction with the phonemes: /e/, /o/, /ɤ/ respectively.
The following is a list of functional units which represent vowels. All of these functional units are attested in the 1994 primer. Nasal Vowels are indicated with by an 〈n〉 following the vowel, though 〈n〉can be a self standing letter in the orthography too.
Ʋ ʋ
Ʋʋ ʋʋ
Ʋ̈ ʋ̈
Ʋ̈ʋ̈ ʋ̈ʋ̈
U u
Uu uu
Un un
Uun uun
Ü ü
Üü üü
Ün ün
Üün üün
Ng ng
O o
Oo oo
Ö ö
Öö öö
Ɔ ɔ
Ɔɔ ɔɔ
Ɔn ɔn
Ɔɔn ɔɔn
I i
In in
Ii ii
Iin iin
Ɩ ɩ
Ɩɩ ɩɩ
E e
Ee ee
Ɛ ɛ
Ɛɛ ɛɛ
Ɛa ɛa
Ɛan ɛan
Ɛn ɛn
Ɛɛn ɛɛn
Ë ë
Ëë ëë
Ën ën
Ëën ëën
A a
Aa aa
An an
Aan aan
Aɔ aɔ
Aɔn aɔn
iʋ̈
iö
ië
ia
ian
ɩa
uë
ʋë
ʋ̈ü
Eastern Dan vowels carry distinctions for length, pitch, and nasality. Nasality is indicated by an 〈n〉 following the vowel. Vowel length has been linguistically analyzed as two separate vowels and is indicated by sequential characters i.e. 〈aa〉. Some vowels are indicated by a digraph 〈ɛa, aɔ〉; these are not diphthongs (vowels that start at one phonetic value and finish at another value); though Eastern Dan also has diphthongs. Dieresis above vowels indicate a separate vowel quality. Vowels with dieresis are thought of as a single character or letter of the alphabet. Dieresis is not a separable unit (even though at the character encoding level in UTF-8 it is separable). The eng /ŋ/, orthographically indicated as 〈ng〉, is linguistically considered a vowel in Eastern Dan. This is in contrast to the typologically normal analysis and IPA symbol /ŋ/ usage as a consonant. Casing: for words starting with long/double vowels, only the first letter is case sensitive for sentence based casing rules. In this presentation of vowels, many vowels are presented, however, it is not true that this represents the Eastern Dan alphabet.
Codepoint (NFC) | Functional Unit | IPA equivalent | Phonetic description |
---|---|---|---|
Uppercase, lowercase | , | ||
U+004E U+0067, U+006E U+0067 | Ng, ng | ŋ | Velar Nasal |
U+0041 U+0061 U+006E, U+0061 U+0061 U+006E | Aan, aan | ãã | Long nasalized front open unrounded vowel |
U+0041 U+0061, U+0061 U+0061 | Aa, aa | aa | Long front open unrounded vowel |
U+0190 U+0061 U+006E, U+025B U+0061 U+006E | Ɛan, ɛan | æ̃ | Short nasalized near-open front unrounded vowel |
U+0190 U+0061, U+025B U+0061 | Ɛa, ɛa | æ | Short near-open front unrounded vowel |
U+0041 U+0254, U+0061 U+0254 | Aɔn, aɔn | ɒ̃ | Short nasalized back rounded vowel |
U+0041 U+0254, U+0061 U+0254 | Aɔ, aɔ | ɒ | Short back rounded vowel |
U+0041 U+006E, U+0061 U+006E | An, an | ã | Short nasalized front open unrounded vowel |
U+0190, U+025B | Ɛ, ɛ | ɛ | Short open-mid front unrounded vowel |
U+0190 U+025B, U+025B U+025B | Ɛɛ, ɛɛ | ɛɛ | Long open-mid front unrounded vowel |
U+0190 U+025B U+006E, U+025B U+025B U+006E | Ɛɛn, ɛɛn | ɛ̃ɛ̃ | Long nasalized open-mid front unrounded vowel |
U+0190 U+006E, U+025B U+006E | Ɛn, ɛn | ɛ̃ | Short nasalized open-mid front unrounded vowel |
U+0186, U+0254 | Ɔ, ɔ | ɔ | Short open-mid back rounded vowel |
U+0186 U+0254, U+0254 U+0254 | Ɔɔ, ɔɔ | ɔɔ | Long open-mid back rounded vowel |
U+0186 U+0254 U+006E, U+0254 U+0254 U+006E | Ɔɔn, ɔɔn | ɔ̃ɔ̃ | Long nasalized open-mid back rounded vowel |
U+0186 U+006E, U+0254 U+006E | Ɔn, ɔn | ɔ̃ | Short nasalized open-mid back rounded vowel |
U+00DC, U+00FC | Ü, ü | ɯ | Short close back unrounded vowel |
U+00DC U+00FC,U+00FC U+00FC | Üü, üü | ɯɯ | Long close back unrounded vowel |
U+00CB, U+00EB | Ë, ë | ʌ | Short open-mid back unrounded vowel |
U+00D6, U+00F6 | Ö, ö | ɤ | Short close-mid back unrounded vowel |
U+00D6 U+00F6, U+00F6 U+00F6 | Öö, öö | ɤɤ | Long close-mid back unrounded vowel |
U+00CB U+00EB, U+00EB U+00EB | Ëë, ëë | ʌʌ | Long open-mid back unrounded vowel |
U+00CB U+00EB U+006E, U+00EB U+00EB U+006E | Ëën, ëën | ʌ̃ʌ̃ | Long nasalized open-mid back unrounded vowel |
U+00CB U+006E, U+00EB U+006E | Ën, ën | ʌ̃ | Short nasalized open-mid back unrounded vowel |
U+0045, U+0065 | E, e | e | Short close-mid front unrounded vowel |
U+0045 U+0065, U+0065 U+0065 | Ee, ee | ee | Long close-mid front unrounded vowel |
U+0041, U+0061 | A, a | a | Short open front unrounded vowel |
U+00DC U+006E, U+00FC U+006E | Ün, ün | ɯ̃ | Short nasalized close back unrounded vowel |
U+00DC U+00FC U+006E,U+00FC U+00FC U+006E | Üün, üün | ɯ̃ɯ̃ | Long nasalized close back unrounded vowel |
U+0055, U+0075 | U, u | u | Short close back rounded vowel |
U+0055 U+0075, U+0075 U+0075 | Uu, uu | uu | Long close back rounded vowel |
U+0055 U+006E, U+0075 U+006E | Un, un | ũ | Short nasalized close back rounded vowel |
U+0055 U+0075 U+006E, U+0075 U+0075 U+006E | Uun, uun | ũũ | Long nasalized close back rounded vowel |
U+004F, U+006F | O, o | o | Short close-mid back rounded vowel |
U+004F U+006F, U+006F U+006F | Oo, oo | oo | Long close-mid back rounded vowel |
U+0049 U+0069 U+006E, U+0069 U+0069 U+006E | Iin, iin | ĩĩ | Long nasalized close front unrounded vowel |
U+0049 U+0069, U+0069 U+0069 | Ii, ii | ii | Long close front unrounded vowel |
U+0049 U+006E, U+0069 U+006E | In, in | ĩ | Short nasalized close front unrounded vowel |
U+0049, U+0069 | I, i | i | Short close front unrounded vowel |
U+0196 U+0269, U+0269 U+0269 | Ɩɩ, ɩɩ | /ee/,[ɪɪ] | Long near-close front unrounded vowel |
U+0196, U+0269 | Ɩ, ɩ | /e/, [ɪ] | Short near-close front unrounded vowel |
U+01B2, U+028B | Ʋ, ʋ | /o/, [ʊ] | Short near-close near-back rounded vowel |
U+01B2 U+028B, U+028B U+028B | Ʋʋ, ʋʋ | /oo/, [ʊʊ] | Long near-close near-back rounded vowel |
U+01B2 U+0308, U+028B U+0308 | Ʋ̈, ʋ̈ | /ɤ/, [ʊ̜] or [ɯ̞̈] | Short near-close (near) back unrounded vowel |
U+01B2 U+0308 U+028B U+0308, U+028B U+0308 U+028B U+0308 | Ʋ̈ʋ̈, ʋ̈ʋ̈ | /ɤ/, [ʊ̜ʊ̜] or [ɯ̞̈ɯ̞̈] | Long near-close (near) back unrounded vowel |
Diphthongs
Codepoint (NFC) | Functional Unit | IPA equivalent | Phonetic description |
---|---|---|---|
Diphthongs | |||
U+0069 U+028B U+0308 | iʋ̈ | iɯ̞̈ | |
U+0069 U+00F6 | iö | iɤ | |
U+0069 U+00EB | ië | iʌ | |
U+0075 U+00EB | uë | uʌ | |
U+028B U+00EB | ʋë | ʊʌ | |
U+028B U+00EB | ʋ̈ü | ʊɯ | |
U+0069 U+0061 | ia | ia | |
U+0069 U+0061 | ian | ĩã | |
U+0196 U+0061 | ɩa | /ea/, [ɪ]a |
Phoneme chartSIL1982,V&K 2008,Ch10
Labial | Dental | Palatal | Velar | Labio-velar | |
---|---|---|---|---|---|
Voiceless Stops | p | t | k | kp, kw | |
Voiced Stops | b | d | g | gb, gw | |
Voiceless fricatives | f | s | |||
Voiced Fricatives | v | z | |||
Implosives | ɓ | ɗ | |||
Continuants | r | l | y | w |
Kp kp
Kw kw
K k
Gb gb
Gw gw
G g
Bh bh
Dh dh
B b
D d
M m
N n
F f
S s
V v
T t
Z z
L l
W w
R r
Y y
P p
The presentation order of consonants here does not represent the alphabet of Dan, but rather the order required to tokenized the text into phonemes.
Codepoint | Grapheme | IPA equivalent | Phonetic description |
---|---|---|---|
Uppercase, lowercase | , | ||
U+004B U+0070, U+006B U+0070 | Kp, kp | k͡p | |
U+004B U+0077, U+006B U+0077 | Kw, kw | k͡w | |
U+004B, U+006B | K, k | k | Voiceless velar stop |
U+0047 U+0062, U+0067 U+0062 | Gb, gb | g͡b | |
U+0047 U+0077, U+0067 U+0077 | Gw, gw | g͡w | |
U+0047, U+0067 | G, g | ɡ | Voiced velar stop |
U+0042 U+0068, U+0062 U+0068 | Bh, bh | ɓ | Voiced bilabial implosive |
U+0044 U+0068, U+0064 U+0068 | Dh, dh | ɗ | Voiced dental implosive |
U+0042, U+0062 | B, b | b | Voiced bilabial stop |
U+0044, U+0064 | D, d | d | Voiced dental stop |
U+004D, U+006D | M, m | m | Bilabial nasal |
U+004E, U+006E | N, n | n | Dental nasal |
U+0046, U+0066 | F, f | f | Voiceless labial dental fricative |
U+0053, U+0073 | S, s | s | |
U+0056, U+0076 | V, v | v | Voiced labial dental fricative |
U+0054, U+0074 | T, t | t | Voiceless dental stop |
U+005A, U+007A | Z, z | ||
U+004C, U+006C | L, l | l | |
U+0057, U+0077 | W, w | ||
U+0052, U+0072 | R, r | ||
U+0059, U+0079 | Y, y | ||
U+0050, U+0070 | P, p | p | Voiceless bilabial stop |
There are four characters which are used to indicate one of ten possible tone patterns for a given word. Not that there are ten possible patterns per word, but rather there are ten patterns in the language. The characters used in the language have no specified Unicode encoding per any known statement. However, based on the behavior of various Unicode characters the following are the obvious correct choice – they are the only look a like characters with letter attributes: 〈˗〉 U+02D7 'MODIFIER LETTER MINUS SIGN', 〈ʼ〉 U+02BC 'MODIFIER LETTER APOSTROPHE', 〈ˮ〉 U+02EE MODIFIER LETTER DOUBLE APOSTROPHE, 〈꞊〈꞊〉 U+A78A 'MODIFIER LETTER SHORT EQUALS SIGN'.
Codepoint | Grapheme Pattern | IPA equivalent | Phonologicall description | Usage Note |
---|---|---|---|---|
U+02EE, No casing | 〈ˮ\p{L}\s〉 | ˥ | xH | double quote starting the word |
U+02BC ,No Casing | 〈ʼ\p{L}\s〉 | ˦ | H | apostrophe starting the word |
Null, No Casing | 〈\s\p{L}\s〉 | ˧ | M | no marking at all for tone |
U+A78A, No Casing | 〈꞊\p{L}\s〉 | ˨ | L | equals sign starting the word |
U+02D7, No Casing | 〈˗\p{L}\s〉 | ˩ | xL | minus sign starting the word |
No Casing | 〈ˮ\p{L}˗〉 | xH falling to L | double quote starting the word with minus at the end of the string | |
No Casing | 〈ʼ\p{L}˗〉 | H falling to L | apostrophe starting the word with minus at the end of the string | |
No Casing | 〈\s\p{L}˗〉 | M falling to L | Null in front followed by minus at the end of the string | |
No Casing | 〈\s\p{L}ʼ〉 | M Raising to H | Null in front followed by apostrophe at the end of the string | |
No Casing | 〈\s\p{L}ˮ〉 | M Raising to xH | Null in front followed by double quote at the end of the string |
ˮ
ʼ
꞊
˗
˗
ʼ
ˮ
French is the national language in the country where the desist population of Eastern Dan speakers reside. It makes some sense to add the necessary characters to a text input solution. However, those characters are separated out so that it is possible to design a text input solution without them.
French is used in the book but there is no indication or attempt to define French writing norms or requirements as they are applied in Ivory Coast (Côte d'Ivoire). The introduction to Dan orthography as presented in ˗Pamɛbhamɛ states:
c, h, j, qu et x n'existent pas en dan.
Which says: "The letters 〈c〉, 〈h〉, 〈j〉, 〈qu〉 and, 〈x〉 do not exist in Dan." While this may be true at a very strict level (when considering functional units rather than actual characters), several issues come to light immediately:
- 〈h〉 is present in 〈bh〉 and 〈dh〉, therefore is in the writing system, and orthography, and is a letter.
- 〈j〉 is often used in loan words like Abidjan.
- 〈qu〉 is not a letter, and 〈u〉 is clearly in Dan's writing system and orthography — as a letter.
So if we were to include characters which are not frequently used in Dan, but are in some way needed in the writing system we might come close to some sort of statement like that of auxiliary characters. Auxiliary characters are characters which are not in an alphabet, might not be in a sort order but are needed in a writing system. Unicode informally defines five categories of characters in TR35. 41
- main / standard
- auxiliary
- index
- punctuation
- number
Image provided by Ian Douglas, rendered in LibreOffice
Some texts have relied on Unicode PUA code points (U+E000..U+F8FF). All Dan texts, should be checked for PUA characters. Known used characters have been:
- Usage of U+F173 COMBINING MACRON-GRAVE. U+F173 was deprecated because the character was added to Unicode 5.0 as 〈◌᷆〉 U+1DC6 'COMBINING MACRON-GRAVE'. There were 22 occurrences in a toolbox file which is not part of this corpus.
This is about 20 issues of a 4 page monthly newsletter/newspaper published between 2005 and 2008. There are several chapters of While waiting for a medical doctor. A new testament is also known to exist, but is not included in this repository or character counts.
It should be noted that the percentages of characters and the percentages of phonemes presented here are attested only in this corpus. This corpus is not necessarily natural speech, and some characters may be over represented because ˗Pamɛbhamɛ, which was targeted at new readers, published a chart of the alphabet in nearly every issue, with some, but not many, words in French.
Significant character changes were made in the corpus in an attempt to bring it into a consistent typographical state. These changes are reflected in the numbers presented in the character level stats.
Linux Command Line wc -l -w -m
stats are presented for the before and after text clean up.
initial-starting-corpus.txt
includes all of the -Pamɛbhamɛ and the chapters of While waiting for a medical doctor.
Round | Lines | Words | Characters |
---|---|---|---|
Initial Starting corpus | 15756 | 86466 | 416782 |
Final corpus | 1827 | 83944 | 393362 |
Character level stats:
Code Point | Glyph | Starting Count | Character alterations up to French Removal | Characters left after French Removal | Final Numbers | Unicode Character Name |
---|---|---|---|---|---|---|
U+0009 | 241 | 240 | 240 | 141 | CHARACTER TABULATION | |
U+000A | 15756 | 10567 | 10567 | 2326 | LINE FEED | |
U+000C | 110 | 110 | 110 | NULL | FORM FEED | |
U+000D | 897 | 897 | 897 | NULL | CARRIAGE RETURN | |
U+001E | 2721 | NULL | NULL | NULL | INFORMATION SEPARATOR TWO | |
U+0020 | 73737 | 79602 | 81759 | 81041 | SPACE | |
U+0021 | ! | 70 | 70 | 70 | 70 | EXCLAMATION MARK |
U+0022 | " | 3346 | NULL | NULL | NULL | QUOTATION MARK |
U+0027 | ' | 7223 | 86 | 8 | 8 | APOSTROPHE |
U+0028 | ( | 482 | 482 | 482 | 482 | LEFT PARENTHESIS |
U+0029 | ) | 483 | 483 | 483 | 483 | RIGHT PARENTHESIS |
U+002A | * | 20 | 20 | 20 | 20 | ASTERISK |
U+002B | + | 110 | 110 | 110 | 110 | PLUS SIGN |
U+002C | , | 4751 | 4758 | 4713 | 4713 | COMMA |
U+002D | - | 27491 | 16 | 16 | 16 | HYPHEN-MINUS |
U+002E | . | 4181 | 4181 | 4106 | 4106 | FULL STOP |
U+002F | \ | 96 | 17 | 17 | 17 | SOLIDUS |
U+0030 | 0 | 867 | 867 | 867 | 867 | DIGIT ZERO |
U+0031 | 1 | 301 | 301 | 286 | 286 | DIGIT ONE |
U+0032 | 2 | 436 | 436 | 421 | 421 | DIGIT TWO |
U+0033 | 3 | 136 | 136 | 136 | 136 | DIGIT THREE |
U+0034 | 4 | 110 | 110 | 110 | 110 | DIGIT FOUR |
U+0035 | 5 | 181 | 181 | 181 | 181 | DIGIT FIVE |
U+0036 | 6 | 81 | 81 | 81 | 81 | DIGIT SIX |
U+0037 | 7 | 160 | 160 | 160 | 160 | DIGIT SEVEN |
U+0038 | 8 | 268 | 268 | 268 | 268 | DIGIT EIGHT |
U+0039 | 9 | 116 | 116 | 116 | 116 | DIGIT NINE |
U+003A | : | 488 | 488 | 473 | 473 | COLON |
U+003B | ; | 79 | 79 | 79 | 79 | SEMICOLON |
U+003C | < | 252 | NULL | NULL | NULL | LESS-THAN SIGN |
U+003D | = | 5458 | NULL | NULL | NULL | EQUALS SIGN |
U+003E | > | 246 | NULL | NULL | NULL | GREATER-THAN SIGN |
U+003F | ? | 202 | 202 | 202 | 202 | QUESTION MARK |
U+0041 | A | 1044 | 1044 | 997 | 997 | LATIN CAPITAL LETTER A |
U+0042 | B | 424 | 424 | 421 | 421 | LATIN CAPITAL LETTER B |
U+0043 | C | 15 | 15 | 15 | 15 | LATIN CAPITAL LETTER C |
U+0044 | D | 767 | 767 | 745 | 745 | LATIN CAPITAL LETTER D |
U+0045 | E | 108 | 108 | 87 | 87 | LATIN CAPITAL LETTER E |
U+0046 | F | 97 | 97 | 97 | 97 | LATIN CAPITAL LETTER F |
U+0047 | G | 448 | 448 | 448 | 448 | LATIN CAPITAL LETTER G |
U+0048 | H | 26 | 26 | 26 | 26 | LATIN CAPITAL LETTER H |
U+0049 | I | 66 | 66 | 66 | 66 | LATIN CAPITAL LETTER I |
U+004A | J | 9 | 9 | 9 | 9 | LATIN CAPITAL LETTER J |
U+004B | K | 1224 | 1224 | 1224 | 1224 | LATIN CAPITAL LETTER K |
U+004C | L | 145 | 145 | 60 | 60 | LATIN CAPITAL LETTER L |
U+004D | M | 671 | 671 | 671 | 671 | LATIN CAPITAL LETTER M |
U+004E | N | 356 | 356 | 335 | 335 | LATIN CAPITAL LETTER N |
U+004F | O | 50 | 47 | 47 | 47 | LATIN CAPITAL LETTER O |
U+0050 | P | 301 | 301 | 301 | 301 | LATIN CAPITAL LETTER P |
U+0052 | R | 8 | 8 | 8 | 8 | LATIN CAPITAL LETTER R |
U+0053 | S | 479 | 479 | 479 | 479 | LATIN CAPITAL LETTER S |
U+0054 | T | 275 | 275 | 254 | 254 | LATIN CAPITAL LETTER T |
U+0055 | U | 50 | 38 | 38 | 38 | LATIN CAPITAL LETTER U |
U+0056 | V | 121 | 121 | 79 | 79 | LATIN CAPITAL LETTER V |
U+0057 | W | 510 | 510 | 510 | 510 | LATIN CAPITAL LETTER W |
U+0059 | Y | 977 | 977 | 977 | 977 | LATIN CAPITAL LETTER Y |
U+005A | Z | 386 | 386 | 386 | 386 | LATIN CAPITAL LETTER Z |
U+005B | [ | 10 | 10 | 10 | 10 | LEFT SQUARE BRACKET |
U+005C | \ | 1 | 1 | 1 | 1 | REVERSE SOLIDUS |
U+005D | ] | 10 | 10 | 10 | 10 | RIGHT SQUARE BRACKET |
U+005F | _ | 1 | NULL | NULL | NULL | LOW LINE |
U+0061 | a | 29865 | 29865 | 28769 | 28769 | LATIN SMALL LETTER A |
U+0062 | b | 9802 | 9802 | 9520 | 9520 | LATIN SMALL LETTER B |
U+0063 | c | 436 | 436 | 23 | 23 | LATIN SMALL LETTER C |
U+0064 | d | 12050 | 12050 | 11782 | 11782 | LATIN SMALL LETTER D |
U+0065 | e | 5906 | 5111 | 3379 | 3379 | LATIN SMALL LETTER E |
U+0066 | f | 430 | 430 | 367 | 367 | LATIN SMALL LETTER F |
U+0067 | g | 10278 | 10278 | 10114 | 10114 | LATIN SMALL LETTER G |
U+0068 | h | 15463 | 15303 | 15004 | 15004 | LATIN SMALL LETTER H |
U+0069 | i | 8567 | 8567 | 7670 | 7670 | LATIN SMALL LETTER I |
U+006A | j | 71 | 71 | 35 | 35 | LATIN SMALL LETTER J |
U+006B | k | 11978 | 11978 | 11963 | 11963 | LATIN SMALL LETTER K |
U+006C | l | 3995 | 3995 | 3417 | 3417 | LATIN SMALL LETTER L |
U+006D | m | 4363 | 4363 | 4016 | 4016 | LATIN SMALL LETTER M |
U+006E | n | 16368 | 16368 | 15532 | 15532 | LATIN SMALL LETTER N |
U+006F | o | 10311 | 9081 | 8220 | 8220 | LATIN SMALL LETTER O |
U+0070 | p | 4505 | 4505 | 4235 | 4235 | LATIN SMALL LETTER P |
U+0071 | q | 103 | 103 | NULL | NULL | LATIN SMALL LETTER Q |
U+0072 | r | 1762 | 1762 | 534 | 534 | LATIN SMALL LETTER R |
U+0073 | s | 6557 | 6557 | 5467 | 5467 | LATIN SMALL LETTER S |
U+0074 | t | 3756 | 3756 | 2781 | 2781 | LATIN SMALL LETTER T |
U+0075 | u | 7973 | 7335 | 6593 | 6593 | LATIN SMALL LETTER U |
U+0076 | v | 469 | 469 | 324 | 324 | LATIN SMALL LETTER V |
U+0077 | w | 8286 | 8286 | 8286 | 8286 | LATIN SMALL LETTER W |
U+0078 | x | 85 | 85 | 7 | 7 | LATIN SMALL LETTER X |
U+0079 | y | 7445 | 7445 | 7333 | 7333 | LATIN SMALL LETTER Y |
U+007A | z | 1969 | 1969 | 1948 | 1948 | LATIN SMALL LETTER Z |
U+00A0 | 374 | NULL | NULL | NULL | NO-BREAK SPACE | |
U+00A8 | ¨ | 1 | NULL | NULL | NULL | DIAERESIS |
U+00AB | « | 102 | 219 | 219 | 219 | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK |
U+00B0 | ° | 1 | 1 | 1 | 1 | DEGREE SIGN |
U+00BB | » | 100 | 213 | 213 | 213 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
U+00CB | Ë | 46 | 46 | 46 | 46 | LATIN CAPITAL LETTER E WITH DIAERESIS |
U+00D6 | Ö | 73 | 76 | 76 | 76 | LATIN CAPITAL LETTER O WITH DIAERESIS |
U+00DC | Ü | 71 | 83 | 83 | 83 | LATIN CAPITAL LETTER U WITH DIAERESIS |
U+00E7 | ç | 21 | 21 | NULL | NULL | LATIN SMALL LETTER C WITH CEDILLA |
U+00E8 | è | 221 | 221 | NULL | NULL | LATIN SMALL LETTER E WITH GRAVE |
U+00E9 | é | 107 | 107 | NULL | NULL | LATIN SMALL LETTER E WITH ACUTE |
U+00EA | ê | 28 | 28 | NULL | NULL | LATIN SMALL LETTER E WITH CIRCUMFLEX |
U+00EB | ë | 8411 | 9206 | 9214 | 9214 | LATIN SMALL LETTER E WITH DIAERESIS |
U+00EE | î | 3 | 3 | NULL | NULL | LATIN SMALL LETTER I WITH CIRCUMFLEX |
U+00F6 | ö | 12699 | 13929 | 13929 | 13929 | LATIN SMALL LETTER O WITH DIAERESIS |
U+00FB | û | 26 | 26 | NULL | NULL | LATIN SMALL LETTER U WITH CIRCUMFLEX |
U+00FC | ü | 5868 | 6506 | 6506 | 6506 | LATIN SMALL LETTER U WITH DIAERESIS |
U+0186 | Ɔ | 58 | 58 | 58 | 58 | LATIN CAPITAL LETTER OPEN O |
U+0190 | Ɛ | 70 | 70 | 70 | 70 | LATIN CAPITAL LETTER OPEN E |
U+0254 | ɔ | 8144 | 8144 | 8144 | 8144 | LATIN SMALL LETTER OPEN O |
U+025B | ɛ | 11951 | 11951 | 11951 | 11951 | LATIN SMALL LETTER OPEN E |
U+0269 | ɩ | 993 | 993 | 993 | 993 | LATIN SMALL LETTER IOTA |
U+028B | ʋ | 1443 | 2765 | 2765 | 2765 | LATIN SMALL LETTER V WITH HOOK |
U+02BC | ʼ | NULL | 20032 | 20015 | 20015 | MODIFIER LETTER APOSTROPHE |
U+02D7 | ˗ | NULL | 31260 | 31260 | 31260 | MODIFIER LETTER MINUS SIGN |
U+02EE | ˮ | NULL | 7844 | 7844 | 7844 | MODIFIER LETTER DOUBLE APOSTROPHE |
U+0304 | ◌ ̄ | 1 | NULL | NULL | NULL | COMBINING MACRON |
U+0308 | ◌ ̈ | 3269 | 1913 | 1913 | 1913 | COMBINING DIAERESIS |
U+03CB | ϋ | 1322 | NULL | NULL | NULL | GREEK SMALL LETTER UPSILON WITH DIALYTIKA |
U+2013 | – | 1065 | NULL | NULL | NULL | EN DASH |
U+2018 | ‘ | 12285 | NULL | NULL | NULL | LEFT SINGLE QUOTATION MARK |
U+2019 | ’ | 748 | NULL | NULL | NULL | RIGHT SINGLE QUOTATION MARK |
U+201A | ‚ | 7 | NULL | NULL | NULL | SINGLE LOW-9 QUOTATION MARK |
U+201C | “ | 4306 | NULL | NULL | NULL | LEFT DOUBLE QUOTATION MARK |
U+201D | ” | 123 | NULL | NULL | NULL | RIGHT DOUBLE QUOTATION MARK |
U+2022 | • | 13 | NULL | NULL | NULL | BULLET |
U+2026 | … | 7 | 7 | 7 | 7 | HORIZONTAL ELLIPSIS |
U+2039 | ‹ | 142 | NULL | NULL | NULL | SINGLE LEFT-POINTING ANGLE QUOTATION MARK |
U+203A | › | 140 | NULL | NULL | NULL | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK |
U+A78A | ꞊ | NULL | 5458 | 5458 | 5458 | MODIFIER LETTER SHORT EQUALS SIGN |
U+FEFF | | 58 | NULL | NULL | NULL | ZERO WIDTH NO-BREAK SPACE |
U+FFF9 | | 17 | NULL | NULL | NULL | INTERLINEAR ANNOTATION ANCHOR |
Valentin Vydrin vydrine[at]gmail[dot]com
Provided the corpus. Issues of the Eastern Dan local newspaper -Pamɛbhamɛ were provided as a series of .doc
files. Three translated texts (translated portions of While waiting for a medical doctor) were provided as a series of .txt
files in related folders: moyan-sanni_ko_dhotroo, moyan-waa_won, moyan-yii_to_gu.
One .doc
file was provided with 22 short (single paragraph length) parallel texts (Eastern Dan - French). And a copy of the New Testament was also provided but is not included in this corpus for copyright reasons.
Hugh Paterson III sil.linguis[at]gmail[dot]com
converted the files following the steps in the File types > Converted files
section.
[gG]weta*.doc
these are the original files provided by VV.
[gG]weta*.pdf
these are PDFs generated my MS Word by Rebecca Paterson from files provided by VV.
[gG]weta*.txt
these files are generated by Hugh Paterson using pdftotext
.
*-sfm.txt
files have a hand coded structure to them that includes making for things like newspaper title, volume, date, tagline, article, heading 1, heading 2, and text of article:
\newspaper ˗Pamɛbhamɛ
\volume-eng 001
\volume-or "Nimlʋʋ : 00x---
\date 2005 'Zë Zë -kwɛ
\tagline "su –bha ‘sëëdhɛ -mü "Gwɛɛtaawo
\body
\article 1
\heading 1
\heading 2
\p 1
Three folders containing some .txt
files are held in the While-waiting-for-a-medical-doctor
directory.
- moyan-sanni_ko_dhotroo
- moyan-waa_won
- moyan-yii_to_gu
The folder sil-pua
contains teckit
files for transferring deprecated Unicode codepoints from SIL's PUA area to their accepted and final Unicode point values.
The following transforms were performed on the original files to extract the text from the originally provided formats, and to clean up character inconsistencies, so that corpus analysis for text input could be optimized. The code presented here is not always exactly what was used. For exact code consult generate-corpus.bash
which is also fairly well annotated.
All of the following commands can be executed by running the generate-corpus.bash
script. The final product will be dan-typing-corpus.txt
.
The issues of ˗Pamɛbhamɛ (provided as [gG]weta*.doc
) were converted to PDFs by opening them in Microsoft Word 16.13.1 (180523) on MacOS 10.13.3. The operating system Print option was invoked, and the "Save as PDF" option was used. The PDFs were transfered to an Ubuntu machine where pdftotext
was used to extract the text to .txt
files. The multitude of text files were then concatenated to a single file mass-text.txt
using the following commands on Ubuntu 16.04 ($
represents the start of the command line, and the command was executed from the root of this repo):
$ cp $( find ./*Pam*/*weta*/*weta*.pdf ) . && for f in *weta*.pdf; do pdftotext $f mass-text_$f.txt; done && rm *.pdf && cat mass-text*.txt >> combined-gweta-text.txt && rm mass-text_*.txt
Each of the three sets of files in the directory While-waiting-for-a-medical-doctor
were concatenated together with the following:
$ cp $( find ./While-waiting-for-a-medical-doctor/*moyan-*/*moyan-*.old.txt ) . && cat moyan-sanni*.old.txt >> combined-moyan-sanni_ko_dhotroo.old.txt && cat moyan-yii*.old.txt >> combined-moyan-yii_gu.old.txt && cat moyan-waa*.old.txt >> combined-moyan-waa_won.old.txt && rm moyan-*.old.txt
These files were then visually inspected in the text editor Atom prior to further processing. Upon visual inspection HTML style heading tags <h>
and </h>
were noticed.
The combined issues of ˗Pamɛbhamɛ and the three files representing While waiting for a medical doctor were then concatenated into the same file for character level processing.
$ cat combined-*.txt >> proof-of-concept-text.txt && rm combined-*.txt
- Teckit was used to make sure that all deprecated PUA Unicode code points moved to current (Unicode 10) code points.
$ txtconv -i proof-of-concept-text.txt -o proof-no-PUA.txt -t sil-pua/SILPUA.tec -if utf8 -of utf8
- Remove all BOM marks (they were created or concatenated into the middle of the file with the
cat
command).
$ cat proof-no-PUA.txt | perl -CS -pe 's/\N{U+FEFF}//g' > proof-no-PUA-no-BOM.txt
- Make sure all the text is encoded as UTF-8 normalized as NFC.
cat proof-of-concept-text.txt | uconv -x -nfd > initial-starting-corpus-nfd.txt
cat proof-of-concept-text-nfd.txt | uconv -x -nfc > initial-starting-corpus-nfc.txt
rm proof-of-concept-text.txt
rm proof-of-concept-text-nfd.txt
mv proof-of-concept-text-nfc.txt proof-of-concept-text.txt
- Markup tags were removed from the text with search and replace.
<h>
and</h>
were replaced with nothing (simple delete). Although$ sed -e 's/<[^>]*>//g' proof-no-PUA-no-BOM.txt > proof-no-PUA-no-BOM-no-TAGS.txt
could be used. Actually if the script is used, thesed
command is used in the script.
In the course of text production several instances of different look-alike characters have been used. This is common for languages that do not have a Keyboard layout that will restrict (or guarantee the consistency) of the characters used to produce texts in that language.
- Correct equal signs
Replace normal equal sign 〈=〉 U+003D with letter equal sign 〈꞊〉 U+A78A.
$ cat proof-no-PUA-no-BOM-no-TAGS.txt | perl -CS -pe 's/\N{U+003D}/\N{U+A78A}/g' > Corrected-equal.txt
- Replace Non-breaking space 〈 〉 U+00A0 'NO-BREAK SPACE' with normal space 〈 〉 U+0020 'SPACE'; target 374 instances.
$ cat Corrected-equal-letterU-nbs-comma.txt| perl -CS -pe 's/\N{U+00A0}/\N{U+0020}/g' > Corrected-equal-letterU-nbs-comma-bs.txt
- Corrected bad non-breaking hyphen. A known issue (as described in this scriptsource blog post) is that MS Word saved the non-breaking hyphen as x1E. This was then interpreted as \00 \1E 〈 〉 U+001E 'INFORMATION SEPARATOR TWO'. So it was supposed to be a non-breaking Hypehn 〈‑〉 U+2011 'NON-BREAKING HYPHEN', but should actually be 〈˗〉 U+02D7 'MODIFIER LETTER MINUS SIGN'.
$ cat Corrected-equal-letterU.txt| perl -CS -pe 's/\N{U+001E}/\N{U+02D7}/g' > Corrected-equal-letterU-nbs.txt
- Correct sequences of comma-dieresis, via the correct spelling of that word. To find the misspelled words:
$ grep -n -P "\x{2C}\x{0308}" proof-of-concept-text.txt
To replace them:
$ sed -e 's/ʋ,̈/ʋ̈,/g' -i proof-of-concept-text.txt
- Correct case of the mis-use of small letter upsilon
U+03CB 〈ϋ〉 'GREEK SMALL LETTER UPSILON WITH DIALYTIKA'; target 1322 instances.
Visual similarities between U+03CB and U+028B + U+0308 have lead some to use UPSILON WITH DIALYTIKA instead of LATIN LETTER V WITH HOOK + COMBINING DIERESES. This is only attested in the corpus to occur in lower case instances. But it is a problem if one uses a conversion tool to convert lower case to upper case (such as is often in text processing or word processing tools, or via the command line like --(actually this perl
code doesn't work and Hugh is not sure why. Got a suggestion ?)-- $ cat some-file-in-Eastern-Dan.txt | perl -CS -pe 's/\p{Ll}/\p{Lu}/g' > display-file-as-uppercase.txt
), because U+03CB is paired with U+03AB 〈Ϋ〉 rather than being paired with U+01B2 + U+0308 〈Ʋ̈〉.
Note: tr '[:lower:]' '[:upper:]'
doesn't work because POSIX classes do not support characters which are above the ASCII range, Unicode characteristics are needed to do this.
Fix the text with:
$ sed -e 's/ϋ/ʋ̈/g' -i proof-of-concept-text.txt
- Fix bad single quote like characters. Corrected non-letter apostrophe 〈'〉 U+0027, 〈’〉 U+2019, and 〈‘〉 U+2018 to letter-apostrophe 〈ʼ〉 U+02BC To move all of these characters to the letter-apostrophe we use the following:
$ perl -CS -pe 's/\N{U+0027}/\N{U+02BC}/g'
and
$ perl -CS -pe 's/\N{U+2019}/\N{U+02BC}/g'
and
$ perl -CS -pe 's/\N{U+2018}/\N{U+02BC}/g'
- Fix bad double quotes
(How do we keep the "good" double quotes?) Corrected non-letter double quote 〈"〉 U+0022, 〈”〉 U+201D, and 〈“〉 U+201C to 〈ˮ〉 U+02EE MODIFIER LETTER DOUBLE APOSTROPHE.
Let's move instances of 〈”〉 U+201D to 〈ˮ〉 U+02EE
$ sed -e 's/”/ˮ/g' -i proof-of-concept-text.txt
Let's move instances of 〈“〉 U+201C to 〈ˮ〉 U+02EE
$ sed -e 's/“/ˮ/g' -i proof-of-concept-text.txt
Let's move instances of 〈"〉 U+0022 to 〈ˮ〉 U+02EE
$ sed -e 's/"/ˮ/g' -i proof-of-concept-text.txt
- Correct double instances of apostrophe to proper quote marks.
Let's move double instances of 〈ʼ〉 U+02BC to 〈ˮ〉 U+02EE
$ sed -e 's/ʼʼ/ˮ/g' -i proof-of-concept-text.txt
- French Quotes
This seems to fix the typos that are generated by not having access to the correct character via a keyboard, although the 1994 reader does use the 〈<〉 glyph instead of the 〈‹〉 glyph. We take this to be a typo in the book. Interestingly there are still 45 instances of 〈<〉 left if we convert them directly as 〈<<〉 to 〈«〉. Some of these are obviously quote marks. But not all of them. I'm just not sure. Most of them do not have a closing tag.
$ sed -e 's/</‹/g' -i proof-of-concept-text.txt
$ sed -e 's/>/›/g' -i proof-of-concept-text.txt
Fix cases of double single French quotes. This is where two symbols together make the "look a like" to intended out come.
$ sed -e 's/‹‹/«/g' -i proof-of-concept-text.txt
$ sed -e 's/››/»/g' -i proof-of-concept-text.txt
- Correct minus signs Underscore 〈_〉 U+005F , EN Dash 〈–〉 U+2013, and Minus-hyphen 〈-〉 U+002D are used to represent what is supposed to be a 〈˗〉 U+02D7 'MODIFIER LETTER MINUS SIGN'. A simple solution is too greedy. Hyphen-minus between numbers is an appropriate use of this character.
LOW LINE 〈_〉U+005F (Underscore) is a simple case with only one instance.
$ grep -n -P "_" proof-of-concept-text.txt
$ sed -e 's/_/˗/g' -i proof-of-concept-text.txt
Visual inspection via grep
shows that most dashes 〈–〉 U+2013 'EN DASH', should be 〈˗〉 U+02D7.
$ grep -n -P "–" proof-of-concept-text.txt
Seven cases are ambiguous.
$ grep -n -P "\s–\s" proof-of-concept-text.txt
In each of these cases, it was decided to move the dash to the right and have it connect with the following word.
1105:doseng ta –sü ‘gü, kö – a
2721:Pë "bin ‘ö ya –a – ga –sê ‘ka kö ziaan ‘ö dho –Pamɛbhamɛ =plöö ‘ü- -zɔn
3336:Pë "bin ‘ö ya –a – ga, -a –blɛɛsü bha, -a ‘klɔɔ- mɔɔ- kö ‘ü- -ya ü –kɔ “sɔɔ. –A do ‘bha –yö nu
3958:Pë "bin ‘ö ya –a – ga –sê ‘ka kö ziaan ‘ö dho –Pamɛbhamɛ =plöö ‘ü- -zɔn
4661:Pë "bin ‘ö ya –a – ga –sê ‘ka kö ziaan ‘ö dho –Pamɛbhamɛ =plöö ‘ü- -zɔn
5485:Pë "bin ‘ö ya –a – ga –sê ‘ka kö ziaan ‘ö dho –Pamɛbhamɛ =plöö ‘ü- -zɔn
15696:'Yö 'wo- zü bho sënnë -ta. 'Yö mɛ 'gbɛ -dede 'wo "yɩɩ to "kɛɛ 'yö 'sɔng- (-a bhɔ -yö =gblɛɛn 'ka =ni) 'yii "yɩɩ 'to. Ö bhɔ 'gü =në- -vin doseng. -Aga 'yö 'wo sënnë kun 'wo- zë 'wo- -kpa 'wo- -bhö. "Kɛɛ =dhɛ 'ö -kë =dhɛ -a –nu 'gu 'yii dɔ bha, 'yö 'wo- pö laa -bhö -laa – dhɛ -yö ö -bha bho. -A -bha zü bho -dhɛ bha 'yö -kë "yɩɩ "yɩɩ -sü mɛ =gban 'gü, " kɛɛ 'sɔng- 'yii "yɩɩ 'to 'zü.
Then all the dashes were turned into modifier letter minus character.
$ sed -e 's/–/˗/g' -i proof-of-concept-text.txt
Minus becomes a bit more complex, because it is correctly used with numbers, and there are misspellings - mostly in that the tone marks are separated from words. However, of the 26 cases of detached minus sign that occur in the corpus, some of them do pattern with the unattached dash, so maybe a real use case for dash can be argued.
$ grep -n -P "\s-\s" proof-of-concept-text.txt | wc -l
$ grep -n -P "\s[–-]\s" proof-of-concept-text.txt
318:ʼwii kë - a ʼwɔn ma
1105:doseng ta –sü ʼgü, kö – a
1188:dhɛ - dedewo ʼyö- nuwɛɛ bho. ꞊Ya ʼgo mü
1761:ʼwo - -ya ʼkɔɔdhö bha, -a
2089:-kɔlookota -nu ʼö ʼwo - ya
2721:Pë ˮbin ʼö ya –a – ga –sê ʼka kö ziaan ʼö dho –Pamɛbhamɛ ꞊plöö ʼü- -zɔn
3000:pö -nu bha- -nu ʼgü kö - bha, -a -nu -bha. -Wo
3336:Pë ˮbin ʼö ya –a – ga, -a –blɛɛsü bha, -a ʼklɔɔ- mɔɔ- kö ʼü- -ya ü –kɔ ˮsɔɔ. –A do ʼbha –yö nu
3858:ˮMaa -dhɛ, ꞊Wegine - -dhöökpö -zuö -sü -nu
3860:-Dukwitaa - ʼka, -a ʼdhö, ꞊naɔ yö -kɔ
3862:-dhɛ, - -nu, ʼwɔn -nu ʼö ʼwo kë sië
3958:Pë ˮbin ʼö ya –a – ga –sê ʼka kö ziaan ʼö dho –Pamɛbhamɛ ꞊plöö ʼü- -zɔn
4520:ʼwo - pö ꞊dhɛ ˮsɛ ˮgla -sü
4661:Pë ˮbin ʼö ya –a – ga –sê ʼka kö ziaan ʼö dho –Pamɛbhamɛ ꞊plöö ʼü- -zɔn
5485:Pë ˮbin ʼö ya –a – ga –sê ʼka kö ziaan ʼö dho –Pamɛbhamɛ ꞊plöö ʼü- -zɔn
6100:Pë ˮbin ʼö ya -a - ga -sê ʼka ; -a do –zë ʼka -dhɛ ˮsaaga –ya –bha. -A -nu mɔɔ
6608:Pë ˮbin ʼö ya -a - ga -sê ʼka .-A do -zë ʼka -dhɛ ˮsaaga -ya -bha. -A -nu mɔɔ-
7089:Pë ˮbin ʼö ya -a - ga -së ʼka .-A do –zë ʼka -dhɛ-ya –bha saaga. –A –nu mɔɔ-
7521:Pë ˮbin ʼö ya -a - ga -sê ʼka -A do –zë ʼka –dhɛ -yö ˮsaga. –A –nu
8255:Pë ˮbin ʼö ya -a - ga -së ʼka , -a do -zë ʼka -dhɛ -yö ˮsaɔdo. -A -nu mɔɔ-
9027:Pë ˮbin ʼö ya -a - ga -së ʼka , -a do -zë ʼka -dhɛ -yö ˮsaɔdo. -A -nu mɔɔ-
12006:kwa zuëˮ ʼdhö dɔ- - ˮta ʼkpɔ.
12517:ʼdhö, ʼyö dho Gana - ʼyaa kë ˮdhinaa ʼka. ʼMɛ
12522:-A -gɛn - tongtongdhö. -Ya -kun
12523:blɛɛsü -mü ꞊dhɛ, Gana - ö -bha ʼö dho ʼö
12524:sɛ bha ꞊në ʼö -kë mɛtii - ʼyaannu.
12531:depanngdanngsü bha, - -ya -wɔn -bha -së -dede
13824:ˮSu : - Zroo -Kwɛ : 2009
14307:ˮSu : - Zroo -Kwɛ : 2009
14888:ʼNë ʼgbɛ -dhɛ -wo mü ʼö ˮgblü ziö -ya yö -a –nu -bha ʼö - -nu -gɔ ʼö to- ʼgü. -A -gɛn -mü ꞊dhɛ ˮyi ꞊ya ʼgo -a -nu kwi ʼgü. ˮYi -bha -go mɛ ʼgü -sü bha, -a ʼgbɛ -dhɛ -yö -sü ˮgblü ziö ˮgbɩgbɩ -nu në- -a -nu -bha. -Ziaanwo kö -pë -yö -da –a ʼgü, kö -a -ta -kpɛɛ ꞊ya dɔ do. ˮYi yö -mɔɔ -a -bha ʼö go mɛ ˮgblʋ̈gblʋ̈ -nu kwa kwi ʼgü, ˮkɛɛ ʼnë -nu ꞊në -a -nu -bha ʼdhö ˮgbɩɩ-. ˮYua bha, -ya -nu -zë ˮvaandhö ˮvaandhö. ꞊Ya kë ˮdhʋ̈, kö -yö -së kö ʼmɛ ʼö ˮyi ʼö ʼgo sië -a kwi ʼgü, ʼkwa -a -kɔ dɔ. Kwa -dho -a -kɔ dɔ- ʼmü ꞊dhɛ ?
14928:• -Ka gwɛ bhɛ ʼö go ö -dhü ʼgü -dee ʼgü bha, -a ʼsü. -Kaa ˮkɛɛ bho- - bha, kö ʼka- -da ˮyi ʼö ˮsukadhu ʼdhö- -bha, -a ꞊bhaa. ꞊Ya ʼma- -bha ꞊së ʼka, kö ʼka- mü.
14988:꞊Dhɛ ʼö- -nu ꞊gban ʼwo wo bo pë -bhö -sü ʼka ꞊dhɛ -kɔ bha- ʼdhö, ʼyö ʼwo dho ˮtan bha- ʼka ʼwo ꞊loo- ʼka ʼpö- bha- ʼgü. -A pö -sü nü ʼö ꞊Geetiinë, kö dhebë bha -waa nu- -nu -dhɛ, ꞊wa nu- ʼka -gblüdë Laabhölaa -dhɛ. ʼWɔn bha- ˮdhia -ma -gblüdë ʼgü -sü bha- -wɔn ʼgü, ʼyö ꞊gbauu ga ʼö -kë ꞊ni -a -da zöng -bha wü ˮpɛpɛ ꞊gban wëëdhö, -a -zo bhɔ ʼö ʼyii kë wo ʼtɔ ʼö bha ʼka ʼö- wo ʼyi bha- ʼka bha, -a -wɔn ʼgü. ˮTʋ̈ng bha- ʼgü, kö ꞊gbauu bha, - a ˮdhiʋ̈ -zian -yö ˮpuu, kö- ꞊taama -dhɛ -yö -tii. Kweɩˮ ʼdhu sɔ -mü ʼö ʼpödö -nu ʼwo- -da ˮwlaan- yi -nu ʼwo -kë : dhe ʼsü -sü -nu, ʼgbaannë troo -nu nu... ʼka, -a -nu -ta. Sɔ suu ʼö ˮdhʋ̈ bha -yö -tun ꞊kö ꞊dɛɛ ꞊Yaoba -nu kwaa- ˮsɛ ʼgü. A suu -yö ʼgbɛ. ʼWɔndɔmɛ -nu nü ʼö ʼwo gun -a -da sië ꞊dhɛ -kɔ ʼö ʼkwa- yö sië- ʼka zöng -gɔ ya- ʼdhö.
15696:ʼYö ʼwo- zü bho sënnë -ta. ʼYö mɛ ʼgbɛ -dede ʼwo ˮyɩɩ to ˮkɛɛ ʼyö ʼsɔng- (-a bhɔ -yö ꞊gblɛɛn ʼka ꞊ni) ʼyii ˮyɩɩ ʼto. Ö bhɔ ʼgü ꞊në- -vin doseng. -Aga ʼyö ʼwo sënnë kun ʼwo- zë ʼwo- -kpa ʼwo- -bhö. ˮKɛɛ ꞊dhɛ ʼö -kë ꞊dhɛ -a –nu ʼgu ʼyii dɔ bha, ʼyö ʼwo- pö laa -bhö -laa – dhɛ -yö ö -bha bho. -A -bha zü bho -dhɛ bha ʼyö -kë ˮyɩɩ ˮyɩɩ -sü mɛ ꞊gban ʼgü, ˮ kɛɛ ʼsɔng- ʼyii ˮyɩɩ ʼto ʼzü.
Since we already got rid of the spaces on the dash in these overlapping cases we will do the same in those cases with minus-hyphen. However, other cases obviously need to go in other directions (attach left, rather than right). So, we are going to try and attach these.
Minus is used with numbers.
$ grep -n -P "\d-" proof-of-concept-text.txt
515:ʼSëëdhɛ "pɛpɛ -nu ʼö ʼwo bha -ka -dho -kpan -a -nu -bha -blɛɛsü ʼgü, "Biya, ʼSilö. A "nimlʋʋ -mü 22-43-12-72 ʼka.
1234:ʼwo bha -ka -dho -kpan -a -nu -bha -blɛɛsü ʼgü, "Biya, ʼSilö. -A "nimlʋʋ -mü 22-
5483:07-17-19-38
5493:‘Ka dho –kpan –a ˮdhɔɔ -bha –bha ‘mɛ ‘ö- ˮpiʋ̈ ˮMaadhö, -wa –dhɛ ˮZɛ Emaniɛɛ. –A –bha tiootioo ˮnimlɔɔ ꞊nɛ: 07-17-19-38
6616:ˮnimlɔɔ ꞊nɛ: 07-17-19-38
9593:ˮsɔɔdhu -bha (11-15),
12433:ʼö yö- ʼka -a -kaɔng do (1-
13851:-kaɔng do (1-10) -bha
16271:ʼSëëdhɛ "pɛpɛ -nu ʼö ʼwo bha -ka -dho -kpan -a -nu -bha -blɛɛsü ʼgü, "Biya, ʼSilö. A "nimlʋʋ -mü 22-43-12-72 ʼka.
16990:ʼwo bha -ka -dho -kpan -a -nu -bha -blɛɛsü ʼgü, "Biya, ʼSilö. -A "nimlʋʋ -mü 22-
21239:07-17-19-38
21249:‘Ka dho –kpan –a ˮdhɔɔ -bha –bha ‘mɛ ‘ö- ˮpiʋ̈ ˮMaadhö, -wa –dhɛ ˮZɛ Emaniɛɛ. –A –bha tiootioo ˮnimlɔɔ ꞊nɛ: 07-17-19-38
22372:ˮnimlɔɔ ꞊nɛ: 07-17-19-38
25349:ˮsɔɔdhu -bha (11-15),
28189:ʼö yö- ʼka -a -kaɔng do (1-
29607:-kaɔng do (1-10) -bha
This search shows us that there are several instances of minus used with numbers. If these should be dashes, or minus I am not completely sure. I guess the relevant question for keyboard layout design, is should a Dan Keyboard require either a 109 key keyboard (i.e with a keypad) or the use of a function key in lieu of a directly accessible minus sign?
We can target all minus signs that are not followed by a digit and are (not-not) preceded by a space. This should give us all word initial minus signs.
$ grep -n -P "[^\d\S]-" proof-of-concept-text.txt
$ sed -e 's/[^\d\S]-/˗/g' -i proof-of-concept-text.txt
$ grep -n -P "\s-\s\D[^ʼ]" proof-of-concept-text.txt
- Remove U+2022 〈•〉 BULLET
There are only 13 instances. It is unlikely that this character is best accessed through a keyboard. So we will drop it from the corpus.
$ sed -e 's/•//g' -i proof-of-concept-text.txt
- Corrected bad commas 〈,〉
There were several 'SINGLE LOW-9 QUOTATION MARK' 〈‚〉 U+201A these were moved to regular comma 〈,〉 U+002C 'COMMA'.
$ cat Corrected-equal-letterU.txt| perl -CS -pe 's/\N{U+201A}/\N{U+002C}/g' > Corrected-equal-letterU-nbs-comma.txt
- Space padded full stop 〈.〉
It is the case the 25 instances of U+002E 〈.〉 FULL STOP have a space on both sides. This is fixed so that the full stop does not have a space between it and the preceding word.
$ grep -n -P -- "\s[.](?=\s)" proof-of-concept-text.txt | wc -l
$ perl -CS -pe 's/\s[.](?=\s)/\s\N{U+002E}/g'
- Space padded Comma 〈,〉
It is the case the 56 instances of U+002C 〈,〉 COMMA have a space on both sides. This is fixed so that the comma does not have a space between it and the preceding word.
$ grep -n -P -- "\s[,](?=\s)" proof-of-concept-text.txt | wc -l
$ perl -CS -pe 's/\s[,](?=\s)/\s\N{U+002C}/g'
- Remove bad line encodings
Different operating systems use different line ending encodings to indicate line endings. We are going to regularize these.
Move U+000A 〈 〉 'LINE FEED' to U+000D 〈 〉 Enter/Return.
$ cat proof-of-concept-text.txt | perl -CS -pe 's/\N{U+000A}/\N{U+000D}/g' > proof-of-concept-text2.txt
- Get rid of wayward U+00A8 Diaeresis and replace it with SPACE
Diaeresis U+00A8 is on second a in waa¨ here:
waa¨ʼwëë˗ ˮgblü ˮsɔɔdo
$ cat proof-of-concept-text.txt | perl -CS -pe 's/\N{U+00A8}/ /g' > proof-of-concept-text2.txt$ rm proof-of-concept-text.txt
$ mv proof-of-concept-text2.txt proof-of-concept-text.txt
- Move form feed to enter/return.
$ cat proof-of-concept-text.txt | perl -CS -pe 's/\N{U+000C}/\N{U+000D}/g' > proof-of-concept-text2.txt
$ rm proof-of-concept-text.txt
$ mv proof-of-concept-text2.txt proof-of-concept-text.txt
- Remove 17 instances of U+FFF9 INTERLINEAR ANNOTATION ANCHOR
$ cat proof-of-concept-text.txt | perl -CS -pe 's/\N{U+FFF9}//g' > proof-of-concept-text2.txt
$ rm proof-of-concept-text.txt
$ mv proof-of-concept-text2.txt proof-of-concept-text.txt
- Remove U+0304 COMBINING MACRON
$ sed -e 's/b̄h/bh/g' -i proof-of-concept-text.txt
Still not completed: 13. Replace U+FFF9 with 'LATIN SMALL LETTER U WITH GRAVE' (U+00F9) target 34
$ cat Corrected-equal.txt | perl -CS -pe 's/\N{U+FFF9}/\N{U+00F9}/g' > Corrected-equal-letterU.txt
-
Remove French words.
-
Figure out what to do with the following:
U+2013 – 1064 EN DASH
U+00E7 ç 21 LATIN SMALL LETTER C WITH CEDILLA
U+00E8 è 221 LATIN SMALL LETTER E WITH GRAVE
One or two non-French cases of mistyping
U+00E9 é 107 LATIN SMALL LETTER E WITH ACUTE
U+00EA ê 28 LATIN SMALL LETTER E WITH CIRCUMFLEX
ʼö ya ˗a ˗ga ˗sê --> e+diaeresis others are french
U+00EE î 3 LATIN SMALL LETTER I WITH CIRCUMFLEX
U+00FB û 26 LATIN SMALL LETTER U WITH CIRCUMFLEX
1 Roberts, David & Valentin Vydrin. Forthcoming. Chapter 10: Eastern Dan. In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
2 Roberts, David & Valentin Vydrin. Forthcoming. Chapter 10: Eastern Dan. In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
3 Simons, Gary. F., & Charles D. Fennig (Eds.) 2017. Ethnologue: Languages of the World, 20th edition. Dallas, TX: SIL International. Online: https://www.ethnologue.com/language/dnj ↩
4 Roberts, David & Valentin Vydrin. Forthcoming. Chapter 10: Eastern Dan. In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
5 Valentin Vydrin. 2012. ISO 639-3 Change Request 2012-083. Online: https://iso639-3.sil.org/request/2012-083. ↩
6 Phillips, A. & M. Davis (Eds.) 2009. Tags for Identifying Languages. Internet Engineering Task Force (IETF). Online: https://tools.ietf.org/html/bcp47. ↩
7 Scannell, Kevin (Ed.) 2009. An Crúbadán - Dan. Saint Louis University, Saint Louis, USA . Online: http://crubadan.org/languages/dnj. ↩
8 Roberts, David & Valentin Vydrin. Forthcoming. Chapter 10: Eastern Dan. In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
9 Roberts, David & Valentin Vydrin. Forthcoming. Chapter 10: Eastern Dan. In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
10 Baba, Tiémoko Sébastien .1978. Yaobhaa -wo bhe pe -se -ya ʼgu (Receuil de contes yacouba, ʼGwetaa -wo). Société Internationale de Linguistique: Abidjan, Ivory Coast. https://www.sil.org/resources/archives/34532. ↩
11 Roberts, David & Valentin Vydrin. Forthcoming. Chapter 10: Eastern Dan. In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
12 Bolli, Margrit & Eva Flik. 1982. Guide d’orthographe pour la langue dan (dialecte gwɛtaawo). Société Internationale de Linguistique,: Abidjan, Ivory Coast. https://www.sil.org/resources/archives/34713. ↩
13 Bolli, Margrit & Eva Flik. 1994. Cours-eclair de lecture pour des lecteurs d français apprenant à lire le Dan (Gwɛɛtaawʋ). Société Internationale de Linguistique: Abidjan, Ivory Coast https://www.sil.org/resources/archives/34670. ↩
14 Bolli, Margrit & Eva Flik. 2000. Rutö. Société Internationale de Linguistique: Abidjan, Ivory Coast. SIL Language and Culture Archive ID: 40701 ↩
15 Bolli, Margrit & Eva Flik. 2000. Zonasö. Société Internationale de Linguistique: Abidjan, Ivory Coast. SIL Language and Culture Archive ID: 40712 ↩
16 Roberts, David, Dana Basnight-Brown & Valentin Vydrin. Marking tone with punctuation: and orthography experiment in Eastern Dan (Côte d’Ivoire). ↩
17 Roberts, David & Valentin Vydrin. Forthcoming. Chapter 10: Eastern Dan. In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
18 Vydrin,Valentin & David Roberts. Forthcoming. Tonal oral reading errors in the orthography of Eastern Dan (Côte d’Ivoire). In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
19 Bolli, Margrit & Eva Flik. 1994. Cours-eclair de lecture pour des lecteurs d français apprenant à lire le Dan (Gwɛɛtaawʋ). Société Internationale de Linguistique: Abidjan, Ivory Coast https://www.sil.org/resources/archives/34670. ↩
20 Bolli, Margrit & Eva Flik. 1982. Guide d’orthographe pour la langue dan (dialecte gwɛtaawo). Société Internationale de Linguistique,: Abidjan, Ivory Coast. https://www.sil.org/resources/archives/34713. ↩
21 Moran, Steven & Robert Forkel. 2017 (November 16). cldf/segments: segments 1.2.1 (Version v1.2.1). Zenodo. http://doi.org/10.5281/zenodo.1051157 . ↩
22 SIL NRSI Glossary for Orthography, font and writing system terms . ↩
23 RFC 3986 http://www.ietf.org/rfc/rfc3986.txt. ↩
24 Wikipedia - Numero Sign: Use in French. https://en.wikipedia.org/w/index.php?title=Numero_sign&oldid=842034015#French. ↩
25 RFC 3986 http://www.ietf.org/rfc/rfc3986.txt. ↩
26 W3C. 2017. HTML5. Recommendation. https://www.w3.org/TR/html5/ .↩
27 Github Engineering. 2017. GitHub Flavored Markdown Spec https://github.github.com/gfm/. ↩
28 Bolli, Margrit & Eva Flik. 1982. Guide d’orthographe pour la langue dan (dialecte gwɛtaawo). Société Internationale de Linguistique,: Abidjan, Ivory Coast. https://www.sil.org/resources/archives/34713. ↩
29 Bolli, Margrit & Eva Flik. 1994. Cours-eclair de lecture pour des lecteurs d français apprenant à lire le Dan (Gwɛɛtaawʋ). Société Internationale de Linguistique: Abidjan, Ivory Coast https://www.sil.org/resources/archives/34670. ↩
30 Holm, Wayne. 1971. Navajo Reading Study: Grapheme and unit frequencies in Navajo. Reading Studies progress report № 12. University of New Mexico. https://eric.ed.gov/?id=ED059806. ↩
31 Venezky, Richard. 1970. The structure of English Orthography. (Janua linguarum., Series minor 82). Mouton: The Hague. http://www.worldcat.org/oclc/840415997 ↩
32 Venezky, Richard. 1967. English Orthography: It's graphical structure and its Relation to sound. Reading Research Quarterly. 2 (3): 75-105. ↩
33 Roberts, David & Valentin Vydrin. Forthcoming. Chapter 10: Eastern Dan. In: Tone orthography and reading fluency: the voice of evidence in ten Niger-Congo languages. ↩
34 Bolli, Margert. 1978. Writing tone with punctuation marks. SIL Notes on Literacy. 23: 16-18. ↩
35 Bolli, Margert. 1991. Orthography difficulties to be overcome by Dan people literate in French. SIL Notes on Literacy. 65: 25-34. ↩
36 SIL International. 2018. Best practice when using non-alphabetic characters in orthographies: Helping languages succeed in the modern world. Cover Page: https://www.sil.org/orthography/fonts-and-technical-issues ; PDF: https://www.sil.org/sites/default/files/tone_and_unicode_issues.pdf Accessed: 17 June 2018. ↩
37 Bolli, Margert. 1978. Writing tone with punctuation marks. SIL Notes on Literacy. 23: 16-18. ↩
38 Bolli, Margert. 1978. Writing tone with punctuation marks. SIL Notes on Literacy. 23: 16-18. ↩
39 Hosken, Martin. 2003. Creating an Orthography Description. http://scripts.sil.org/cms/scripts/page.php?cat_id=EncodingPrinciples ↩
40 Constable, Peter G. 2002. Toward a Model for Language Identification Defining an ontology of language-related categories. SIL Electronic Working Papers 2002-003. Dallas, Tx: SIL International. Online: https://www.sil.org/resources/publications/entry/7853 ↩
Copyright claims are un-clear. If authors of content were employed by SIL, SIL International would be the copyright owner. (This is only relevant because the works themselves do not have copyright claims or licenses attached, but do reference SIL's address.) Otherwise copyright belongs to the authors, or their employer. It does not readily seem that the authors are attributed in the corpus, but they might be in the orthography.
Only copyright owners can license materials. Therefore this content bears no license, as Hugh makes no content claims on the content of the corpus, and did not receive content under license. Use under the fair use doctrine is assumed.
The README.md
which is Hugh Paterson III's contribution is copyright Hugh Paterson III 2018, and licensed under the Creative Commons Attribution 4.0 License.
The generate-corpus.bash
script is also Hugh's contribution and is licensed under the MIT version provided.
Other content such as the content contained under the folder /SILPUA
is licensed as originally offered (MIT).