Image homepage of jouni filip maho
[ maho@brevet.nu ]
Image
 
 
« BACK TO FRONT | Papers & Stuff | Blog à la Maho | Foto Galleries | Assorted Links
BANTU LANGUAGES & DESCRIPTIONAL DENSITY
Revised 2008-07-05

I have recently been having some fun with bibliographical data. Specifically, I have tried to determine a simple way to calculate the "descriptional density" for various African languages, especially with regard to grammar descriptions.

Descriptional density (a concept I've invented myself, I think) aims to determine how well-described any given language is in terms of existing grammar books and dictionaries. For instance, if a given language has only one grammar book written about it, and another language has fourteen grammar books written about it, then obviously the latter language is more well-described than the former. In other words, it's descriptional density is higher.

Clearly there are several factors that are relevant to consider when trying to calculate the descriptional density for a language:

  1. number of publications or titles, since two grammars are better than one;
  2. size of description, the rationale being that a 400-page grammar is better than a 40-page sketch, all else being equal;
  3. number of authors involved, since two independently written grammars are better than two books written by the same author;
  4. number of varieties described, since two grammars of two separate dialects cover more than two grammars of the same dialect;
  5. number of years during which the grammars have been published or produced, since two books from two separate time periods is essentially equal to describing two separate varieties, and thus covers more, all else being equal;
  6. availability of the grammar(s), since, in general, a published grammar has a wider distribution than an unpublished manuscript sitting in an archive.

There are no doubt other factors that could or should be taken into account, but the above are the most relevant ones, as far as I can see. However, most of the factors are difficult to operationalize in simple ways. For instance, the size of a grammar book (factor 2) is not by any necessity related to its inherent usefulness, quality or even comprehensiveness. The number of authors (factor 3) is problematic to calculate when there are several anonymous works. Keeping track of whichever dialectal variety a grammar book deals with (factor 4) is well nigh impossible, as many grammars either don't give such specifications or focus on standardized (artificial) varieties. The availability of an item (factor 6) is difficult to determine easily, since many published books, especially older ones, can be just as difficult to access as some old manuscripts, while some manuscripts (and many old books) can easily be accessed via the internet.

There are seemingly only two factors that can be used in calculations without stumblig onto major difficulties, namely, number of titles or works (factor 1) and the amount of time that a given language has been published about or time span (factor 5). For the two factors, I will use the symbols W and T, respectively.

Briefly, if we add a value corresponding to the number of titles (a W value) with another value corresponding to the time span (a T value), we should get a sensible value representing the descriptional density of a given language. (Even though this means discarding several seemingly relevant factors, a calculation based on only W and T should still give us a reasonably informative result.)

However, simply adding the number of titles with the number of years will not do, for various reasons. Instead we need to recalculate these figures into something more useful. The mathematical formula itself looks like this (and is explained more fully below):

   

In general, one grammar book equals a W value of 1. However, many grammar books appear in second, third, fourth, and more, editions. It seems unintuitive to give a second edition the same weight as a first edition. After all, it's still essentially the same book, albeit with some minor or major revisions. Hence it's convenient to distinguish primary works (W1) from secondary works (W2). While primary works are given a full value of 1, secondary works are given only 1/3 (one third), a value chosen arbitrarily. Indeed, it could be almost any other number or fraction.

Secondary works include more than just second editions. For instance, Hawkinson published three separate titles on Tanzanian Swahili in 1979, namely, Tanzanian Swahili: grammar handbook, Tanzanian Swahili: special skills handbook and Tanzanian Swahili: communication and culture handbook. They all belong together. The most intuitive way to handle these three titles is to count only one of them as primary and the others as secondary. (Alternatively, they could all be counted as part of a single title.)

Similarly, when one and the same author (or author constellation) has published several grammars of one the same language, I have regarded only one of them as primary, and the others as secondary. In effect, this means that W is not only an indication of number of works (factor 1) but partly also number of authors involved (factor 3).

Moreover, I have counted translations as separate primary works; the rationale behind it being that one English and one French grammar is better than two English grammars. If for no other reason, at least they widen the intended audience. (I'm not convinced this is a good idea, but I can't think of a better way to deal with translations at present.)

T, or time span, represents the number of years spanning between the publication of the first and the latest grammar. For instance, my bibliography includes 135 primary works (grammar descriptions) for Swahili. The earliest of these was published in 1850, and the latest in 2006. This gives a time span of 156 years. In order for this number not to inflate the calculations unnecessarily, it needs to be whittled down a bit. Hence I use the square root of the actual time span in the formula.

By adding the total number of primary works (or, 1 index value for every primary work), with a third of the total number of secondary works (i.e. 1/3 index value for every W2, W3, etc.), and the square root of the time span, I get a total index value representing the descriptional density (DD) for that particular language. The lowest DD value any language can have is 1. That is, one grammar book by one author (i.e. W=1) plus a zero time span as there is no time span involved in a single publication (hence T=0).

The table below shows a list of twenty-two Bantu languages ranked according to their DD values. (Note that some of the figures are slightly revised since I first posted this on my blog. The bibliographical data itself can be viewed/downloaded here.)

    LANGUAGE DD VALUE   =   tot. W value ( W1 / W2 )   +   sq. T value ( T )        
    Swahili G40 171.49   =   159.00 ( 132 / 81 )   +   12.49 ( 156 )
    Zulu S42 70.53   =   58.00 ( 42 / 48 )   +   12.53 ( 157 )
    Kikongo H16 67.30   =   48.67 ( 45 / 11 )   +   18.63 ( 347 )
    Chewa/Nyanja N31 51.12   =   39.67 ( 31 / 26 )   +   11.45 ( 131 )
    Xhosa S41 42.15   =   29.00 ( 20 / 27 )   +   13.15 ( 173 )
    Shona S10 41.63   =   31.00 ( 26 / 15 )   +   10.63 ( 113 )
    Umbundu R11 39.61   =   22.00 ( 19 / 9 )   +   17.61 ( 310 )
    Setswana S31 39.08   =   26.00 ( 20 / 18 )   +   13.08 ( 171 )
    Lingala C30b 37.30   =   26.67 ( 23 / 11 )   +   10.63 ( 113 )
    Luganda JE15 31.58   =   20.67 ( 17 / 11 )   +   10.91 ( 119 )
    Sesotho S33 31.45   =   19.00 ( 16 / 9 )   +   12.45 ( 155 )
    Bemba M42 29.18   =   19.33 ( 15 / 13 )   +   9.85 ( 97 )
    Herero R31 27.04   =   14.67 ( 13 / 5 )   +   12.37 ( 153 )
    North Sotho S32 25.91   =   15.00 ( 14 / 3 )   +   10.91 ( 119 )
    Luba-Kasai L31 25.82   =   15.33 ( 13 / 7 )   +   10.49 ( 110 )
    Kirundi JD62 24.23   =   14.33 ( 12 / 7 )   +   9.90 ( 98 )
    Kinyarwanda JD61 21.54   =   12.00 ( 9 / 9 )   +   9.54 ( 91 )
    Sukuma F21 20.87   =   11.33 ( 11 / 1 )   +   9.54 ( 91 )
    Changana/Tsonga S53 18.96   =   8.33 ( 8 / 1 )   +   10.63 ( 113 )
    Makhuwa P31 18.35   =   7.67 ( 7 / 2 )   +   10.68 ( 114 )
    Kikuyu E51 17.31   =   7.67 ( 7 / 2 )   +   9.64 ( 93 )
    Haya JE22 14.66   =   6.00 ( 6 / 0 )   +   8.66 ( 75 )
    The codes by the language names come from a referential coding system of the Bantu languages.

Notice how the ranking only roughly corresponds to the actual number of grammar descriptions (whether we look at primary works only or primary and secondary works jointly). The most important figure is that determined by W1, i.e. number of primary works, which intuitively feels right. Furthermore, by combining the number of works with the time span involved, we get a more sophisticated picture of how well-described any given language is than had we looked only at number of publications. (The sophistication is actually two-fold, as the formula also distinguishes between primary and secondary works.)

The data can also be visualised in a diagram:

DD Diagram

As mentioned, I have only looked at grammar descriptions. For a more comprehensive look, I need to look also at dictionaries, but that's a project for another sleepless night.


« BACK TO FRONT | Papers & Stuff | Blog à la Maho | Foto Galleries | Assorted Links
 
 
these pages are sometimes updated