Image Image
Image homepage of jouni filip maho
[ ]
« BACK TO FRONT | Papers & Stuff | Blog à la Maho | Music | Foto Galleries | Assorted Links
Image Image

Revised 2008-09-30

I have recently had some fun with bibliographical data. Specifically, I have tried to determine a simple way to calculate the "descriptional density" for various African languages, especially with regard to grammar descriptions.

Descriptional density (a concept I've invented myself, I think) aims to determine how well-described any given language is in terms of existing grammar books and dictionaries. For instance, if a given language has only one grammar book written about it, and another language has fourteen grammar books written about it, then obviously the latter language is more well-described than the former. In other words, it's descriptional density is higher.

Clearly there are several factors that would be relevant to consider when trying to calculate the descriptional density for any given language:

  1. Number of publications or titles.
    Two grammars are better than one.
  2. Size of description.
    In general, a 400-page grammar is better than a 40-page sketch, all else being equal.
  3. Number of authors involved.
    Two independently written grammars offer a "wider latitude" (i.e. more points of view) than two books written by the same author.
  4. Number of varieties described.
    Two grammars of two separate dialects cover more than two grammars of the same dialect.
  5. Number of years during which several descriptions have been produced.
    Two descriptions from two separate time periods is essentially equal to describing two separate varieties, and thus it covers more structural data, all else being equal;
  6. Impact (distribution and availability) of the description(s).
    In general, a published grammar reaches a wider audience than an unpublished manuscript. By reaching a wider audience, a description is read and evaluated by more people, which establishes its credibility. (Admittedly this is only indirectly relevant, but relevant nonetheless.)

There are no doubt other factors that could or should be taken into account, but the above are the most relevant ones, as far as I can see. However, most of the factors listed above are difficult to operationalize in simple ways. For instance, the size of a grammar book (factor 2) is not by any necessity related to its inherent usefulness, quality or even comprehensiveness. The number of authors (factor 3) is problematic to calculate in case of anonymous works, incl. those credited to so-called corporate authors, and whether to consider main informants (assuming they are mentioned at all) as co-authors or not. Keeping track of whichever dialectal variety a grammar book deals with (factor 4) is well nigh impossible, since many grammars either don't give such specifications or focus on standardized (artificial) varieties. The availability of an item (factor 6) is difficult to determine easily, since many published books, especially older ones, can be just as difficult to access as some old manuscripts, while some manuscripts (and many old books) can easily be accessed via the internet.

There are seemingly only two factors that can be used in calculations of descriptional density without stumblig onto major difficulties, namely, number of titles or works (factor 1) and the amount of time that a given language has been published about, or time span (factor 5). For the two factors, I will use the symbols W and T, respectively.

Briefly, if we add a value corresponding to the number of titles (a W value) with another value corresponding to the time span (a T value), we should get a sensible value representing the descriptional density of a given language. (Even though this means discarding several seemingly relevant factors, a calculation based on only W and T should still give us a reasonably informative result.)

However, simply adding the number of titles with the number of years will not do, for various reasons. Instead we need to recalculate these figures into something useful. The mathematical formula itself looks like this (and is explained more fully below):


In general, one grammar book equals a W value of 1. However, many grammar books appear in second, third, fourth, and more, editions. It would seem unintuitive to give a second edition the same weight as a first edition. After all, it's still essentially the same book, albeit with some minor or major revisions. Hence we need to distinguish primary works (W1) from secondary works (W2). While primary works (e.g. first editions) are given a full value of 1, secondary works (e.g. non-first editions) are given only 1/3 (one third). This latter value was chosen arbitrarily; indeed, it could have been almost any other number or fraction.

Secondary works (or the W2 stuff) include more than just second editions. For instance, Hawkinson published three separate titles on Tanzanian Swahili in 1979, namely, Tanzanian Swahili: grammar handbook, Tanzanian Swahili: special skills handbook and Tanzanian Swahili: communication and culture handbook. They all belong together. The most intuitive way to handle these three titles is to count only one of them as primary and the others as secondary. (Alternatively, they could all be counted as part of a single title.)

Similarly, when one and the same author (or author constellation) has published several grammars of one the same language, I have regarded only one of them as primary, and the others as secondary. In effect, this means that W is not only an indication of the total number of works (factor 1) but partly also number of authors involved (factor 3).

Moreover, I have counted translations as separate primary works; the rationale being that one English and one French grammar is better than two English grammars. If for no other reason, at least they widen the intended audience; cfr factor 6 above. (I'm not convinced this is a good idea, but I can't think of a better way to deal with translations at present.)

T, or time span, represents the number of years spanning between the publication of the first and the latest grammar. For instance, my bibliography includes 132 primary works (grammar descriptions) for Swahili. The earliest of these was published in 1850, and the latest in 2006. This gives a time span of 156 years. In order for this number not to inflate the calculations unnecessarily, it needs to be whittled down a bit. Hence I use the square root of the actual time span in the formula.

By adding the total number of primary works (or, 1 index value for every primary work), with a third of the total number of secondary works (i.e. 1/3 index value for every W2, W3, etc.), and the square root of the time span, we get a total index value representing the descriptional density (DD) for that particular language. The lowest DD value any language can have is 1. That is, one grammar book by one author (i.e. W1=1), no second edition (W2=0), plus a zero time span as there is no time span involved in a single publication (T=0).

The table below shows a list of twenty-two Bantu languages ranked according to their DD values. (Note that some of the figures are slightly revised since I first posted this on my blog. The bibliographical data itself can be viewed/downloaded here.)

(1) LANGUAGE DD VALUE   =   tot. W value ( W1 / W2 )   +   sq. T value ( T )        
    Swahili G40 171.49   =   159.00 ( 132 / 81 )   +   12.49 ( 156 )
    Zulu S42 70.53   =   58.00 ( 42 / 48 )   +   12.53 ( 157 )
    Kikongo H16 67.30   =   48.67 ( 45 / 11 )   +   18.63 ( 347 )
    Chewa/Nyanja N31 51.12   =   39.67 ( 31 / 26 )   +   11.45 ( 131 )
    Xhosa S41 42.15   =   29.00 ( 20 / 27 )   +   13.15 ( 173 )
    Shona S10 41.63   =   31.00 ( 26 / 15 )   +   10.63 ( 113 )
    Umbundu R11 39.61   =   22.00 ( 19 / 9 )   +   17.61 ( 310 )
    Setswana S31 39.08   =   26.00 ( 20 / 18 )   +   13.08 ( 171 )
    Lingala C30b 37.30   =   26.67 ( 23 / 11 )   +   10.63 ( 113 )
    Luganda JE15 31.58   =   20.67 ( 17 / 11 )   +   10.91 ( 119 )
    Sesotho S33 31.45   =   19.00 ( 16 / 9 )   +   12.45 ( 155 )
    Bemba M42 29.18   =   19.33 ( 15 / 13 )   +   9.85 ( 97 )
    Herero R31 27.04   =   14.67 ( 13 / 5 )   +   12.37 ( 153 )
    North Sotho S32 25.91   =   15.00 ( 14 / 3 )   +   10.91 ( 119 )
    Luba-Kasai L31 25.82   =   15.33 ( 13 / 7 )   +   10.49 ( 110 )
    Kirundi JD62 24.23   =   14.33 ( 12 / 7 )   +   9.90 ( 98 )
    Kinyarwanda JD61 21.54   =   12.00 ( 9 / 9 )   +   9.54 ( 91 )
    Sukuma F21 20.87   =   11.33 ( 11 / 1 )   +   9.54 ( 91 )
    Changana/Tsonga S53 18.96   =   8.33 ( 8 / 1 )   +   10.63 ( 113 )
    Makhuwa P31 18.35   =   7.67 ( 7 / 2 )   +   10.68 ( 114 )
    Kikuyu E51 17.31   =   7.67 ( 7 / 2 )   +   9.64 ( 93 )
    Haya JE22 14.66   =   6.00 ( 6 / 0 )   +   8.66 ( 75 )
    The codes by the language names come from a referential coding system of the Bantu languages.

The most important figure is that determined by W1, i.e. number of primary works. Still, notice how the ranking only roughly corresponds to the actual number of grammar descriptions (whether we look at primary works only or primary and secondary works jointly). By combining the number of works with the time span involved, we get a more sophisticated picture of how well-described any given language is than had we looked only at number of publications.

The DD data can also be visualised as follows:

DD Graph

There are various interpretations possible of the DD data, esp. as it appears in the graph above. For instance, most languages in the sample form a lump at the bottom of the graph (which is where most Bantu languages would end up, if we were to increase the number of sampled languages). It is a crude visualisation of the need for more descriptive works.

The visualisation can also be taken to indicate that some languages (e.g. Nyanja N31, Zulu S42, and especially Swahili G42) display a seemingly healthy growth of descriptions in comparison to other Bantu languages. That is, the intellectual attention paid to the grammar of these languages seems vibrant, while similar attention paid to languages such as Herero R31, Sesotho S33, Tswana S31, Xhosa S41, and especially Umbundu R11 and Kikongo H16 are more or less clearly lagging behind. They have been described for a longer period of time (they are further to the right in the graph) but have fewer works/titles devoted to them (and therefore appear lower vertically).

The situation is particularly severe in the case of Umbundu R11 and Kikongo H16, the two oldest recorded Bantu languages. Their first descriptions appeared more than 300 years ago. Hence in descriptive terms, they are roughly twice as old as any other Bantu language. Theoretically, they should both top the ranked list, but they don't. Kikongo is at 3, Umbundu at 7.

DD Graph II

The obvious implication here is that we should ensure that as few languages as possible are led down the Kikongo Path, but instead helped to climb the Swahili Ladder. (Apologies for the crude poetry, but I couldn't resist.) The visualised DD ranking can potentially aid us in that quest. For instance, while the DD numbers indicate that Herero R31, Sesotho S33, Tswana S31, and Xhosa S41 are relatively well-described languages, the graph identifies them as potential risk-languages.

As mentioned, I have only looked at grammar descriptions. For a more comprehensive look, we would need to add data also about dictionaries, but that's a project for another sleepless night.

Image Image
« BACK TO FRONT | Papers & Stuff | Blog à la Maho | Music | Foto Galleries | Assorted Links
Image Image
these pages are sometimes updated