BANTU LANGUAGES & DESCRIPTIONAL DENSITY
Revised 20080930
I have recently had some fun with bibliographical data. Specifically, I have tried to determine a simple way to calculate the "descriptional density" for various African languages, especially with regard to grammar descriptions.
Descriptional density (a concept I've invented myself, I think) aims to determine how welldescribed any given language is in terms of existing grammar books and dictionaries. For instance, if a given language has only one grammar book written about it, and another language has fourteen grammar books written about it, then obviously the latter language is more welldescribed than the former. In other words, it's descriptional density is higher.
Clearly there are several factors that would be relevant to consider when trying to calculate the descriptional density for any given language:
 Number of publications or titles.
Two grammars are better than one.
 Size of description.
In general, a 400page grammar is better than a 40page sketch, all else being equal.
 Number of authors involved.
Two independently written grammars offer a "wider latitude" (i.e. more points of view) than two books written by the same author.
 Number of varieties described.
Two grammars of two separate dialects cover more than two grammars of the same dialect.
 Number of years during which several descriptions have been produced.
Two descriptions from two separate time periods is essentially equal to describing two separate varieties, and thus it covers more structural data, all else being equal;
 Impact (distribution and availability) of the description(s).
In general, a published grammar reaches a wider audience than an unpublished manuscript. By reaching a wider audience, a description is read and evaluated by more people, which establishes its credibility. (Admittedly this is only indirectly relevant, but relevant nonetheless.)
There are no doubt other factors that could or should be taken into account, but the above are the most relevant ones, as far as I can see. However, most of the factors listed above are difficult to operationalize in simple ways. For instance, the size of a grammar book (factor 2) is not by any necessity related to its inherent usefulness, quality or even comprehensiveness. The number of authors (factor 3) is problematic to calculate in case of anonymous works, incl. those credited to socalled corporate authors, and whether to consider main informants (assuming they are mentioned at all) as coauthors or not. Keeping track of whichever dialectal variety a grammar book deals with (factor 4) is well nigh impossible, since many grammars either don't give such specifications or focus on standardized (artificial) varieties. The availability of an item (factor 6) is difficult to determine easily, since many published books, especially older ones, can be just as difficult to access as some old manuscripts, while some manuscripts (and many old books) can easily be accessed via the internet.
There are seemingly only two factors that can be used in calculations of descriptional density without stumblig onto major difficulties, namely, number of titles or works (factor 1) and the amount of time that a given language has been published about, or time span (factor 5). For the two factors, I will use the symbols W and T, respectively.
Briefly, if we add a value corresponding to the number of titles (a W value) with another value corresponding to the time span (a T value), we should get a sensible value representing the descriptional density of a given language. (Even though this means discarding several seemingly relevant factors, a calculation based on only W and T should still give us a reasonably informative result.)
However, simply adding the number of titles with the number of years will not do, for various reasons. Instead we need to recalculate these figures into something useful. The mathematical formula itself looks like this (and is explained more fully below):
In general, one grammar book equals a W value of 1. However, many grammar books appear in second, third, fourth, and more, editions. It would seem unintuitive to give a second edition the same weight as a first edition. After all, it's still essentially the same book, albeit with some minor or major revisions. Hence we need to distinguish primary works (W1) from secondary works (W2). While primary works (e.g. first editions) are given a full value of 1, secondary works (e.g. nonfirst editions) are given only 1/3 (one third). This latter value was chosen arbitrarily; indeed, it could have been almost any other number or fraction.
Secondary works (or the W2 stuff) include more than just second editions. For instance, Hawkinson published three separate titles on Tanzanian Swahili in 1979, namely, Tanzanian Swahili: grammar handbook, Tanzanian Swahili: special skills handbook and Tanzanian Swahili: communication and culture handbook. They all belong together. The most intuitive way to handle these three titles is to count only one of them as primary and the others as secondary. (Alternatively, they could all be counted as part of a single title.)
Similarly, when one and the same author (or author constellation) has published several grammars of one the same language, I have regarded only one of them as primary, and the others as secondary. In effect, this means that W is not only an indication of the total number of works (factor 1) but partly also number of authors involved (factor 3).
Moreover, I have counted translations as separate primary works; the rationale being that one English and one French grammar is better than two English grammars. If for no other reason, at least they widen the intended audience; cfr factor 6 above. (I'm not convinced this is a good idea, but I can't think of a better way to deal with translations at present.)
T, or time span, represents the number of years spanning between the publication of the first and the latest grammar. For instance, my bibliography includes 132 primary works (grammar descriptions) for Swahili. The earliest of these was published in 1850, and the latest in 2006. This gives a time span of 156 years. In order for this number not to inflate the calculations unnecessarily, it needs to be whittled down a bit. Hence I use the square root of the actual time span in the formula.
By adding the total number of primary works (or, 1 index value for every primary work), with a third of the total number of secondary works (i.e. 1/3 index value for every W2, W3, etc.), and the square root of the time span, we get a total index value representing the descriptional density (DD) for that particular language. The lowest DD value any language can have is 1. That is, one grammar book by one author (i.e. W1=1), no second edition (W2=0), plus a zero time span as there is no time span involved in a single publication (T=0).
The table below shows a list of twentytwo Bantu languages ranked according to their DD values. (Note that some of the figures are slightly revised since I first posted this on my blog. The bibliographical data itself can be viewed/downloaded here.)
(1) 
LANGUAGE 
DD VALUE 
= 
tot. W value 
( W1 / W2 ) 
+ 
sq. T value 
( T ) 

Swahili G40 
171.49 
= 
159.00 
( 132 / 81 ) 
+ 
12.49 
( 156 ) 

Zulu S42 
70.53 
= 
58.00 
( 42 / 48 ) 
+ 
12.53 
( 157 ) 

Kikongo H16 
67.30 
= 
48.67 
( 45 / 11 ) 
+ 
18.63 
( 347 ) 

Chewa/Nyanja N31 
51.12 
= 
39.67 
( 31 / 26 ) 
+ 
11.45 
( 131 ) 

Xhosa S41 
42.15 
= 
29.00 
( 20 / 27 ) 
+ 
13.15 
( 173 ) 

Shona S10 
41.63 
= 
31.00 
( 26 / 15 ) 
+ 
10.63 
( 113 ) 

Umbundu R11 
39.61 
= 
22.00 
( 19 / 9 ) 
+ 
17.61 
( 310 ) 

Setswana S31 
39.08 
= 
26.00 
( 20 / 18 ) 
+ 
13.08 
( 171 ) 

Lingala C30b 
37.30 
= 
26.67 
( 23 / 11 ) 
+ 
10.63 
( 113 ) 

Luganda JE15 
31.58 
= 
20.67 
( 17 / 11 ) 
+ 
10.91 
( 119 ) 

Sesotho S33 
31.45 
= 
19.00 
( 16 / 9 ) 
+ 
12.45 
( 155 ) 

Bemba M42 
29.18 
= 
19.33 
( 15 / 13 ) 
+ 
9.85 
( 97 ) 

Herero R31 
27.04 
= 
14.67 
( 13 / 5 ) 
+ 
12.37 
( 153 ) 

North Sotho S32 
25.91 
= 
15.00 
( 14 / 3 ) 
+ 
10.91 
( 119 ) 

LubaKasai L31 
25.82 
= 
15.33 
( 13 / 7 ) 
+ 
10.49 
( 110 ) 

Kirundi JD62 
24.23 
= 
14.33 
( 12 / 7 ) 
+ 
9.90 
( 98 ) 

Kinyarwanda JD61 
21.54 
= 
12.00 
( 9 / 9 ) 
+ 
9.54 
( 91 ) 

Sukuma F21 
20.87 
= 
11.33 
( 11 / 1 ) 
+ 
9.54 
( 91 ) 

Changana/Tsonga S53 
18.96 
= 
8.33 
( 8 / 1 ) 
+ 
10.63 
( 113 ) 

Makhuwa P31 
18.35 
= 
7.67 
( 7 / 2 ) 
+ 
10.68 
( 114 ) 

Kikuyu E51 
17.31 
= 
7.67 
( 7 / 2 ) 
+ 
9.64 
( 93 ) 

Haya JE22 
14.66 
= 
6.00 
( 6 / 0 ) 
+ 
8.66 
( 75 ) 

The codes by the language names come from a referential coding system of the Bantu languages. 
The most important figure is that determined by W1, i.e. number of primary works. Still, notice how the ranking only roughly corresponds to the actual number of grammar descriptions (whether we look at primary works only or primary and secondary works jointly). By combining the number of works with the time span involved, we get a more sophisticated picture of how welldescribed any given language is than had we looked only at number of publications.
The DD data can also be visualised as follows:
(2) 

There are various interpretations possible of the DD data, esp. as it appears in the graph above. For instance, most languages in the sample form a lump at the bottom of the graph (which is where most Bantu languages would end up, if we were to increase the number of sampled languages). It is a crude visualisation of the need for more descriptive works.
The visualisation can also be taken to indicate that some languages (e.g. Nyanja N31, Zulu S42, and especially Swahili G42) display a seemingly healthy growth of descriptions in comparison to other Bantu languages. That is, the intellectual attention paid to the grammar of these languages seems vibrant, while similar attention paid to languages such as Herero R31, Sesotho S33, Tswana S31, Xhosa S41, and especially Umbundu R11 and Kikongo H16 are more or less clearly lagging behind. They have been described for a longer period of time (they are further to the right in the graph) but have fewer works/titles devoted to them (and therefore appear lower vertically).
The situation is particularly severe in the case of Umbundu R11 and Kikongo H16, the two oldest recorded Bantu languages. Their first descriptions appeared more than 300 years ago. Hence in descriptive terms, they are roughly twice as old as any other Bantu language. Theoretically, they should both top the ranked list, but they don't. Kikongo is at 3, Umbundu at 7.
(3) 

The obvious implication here is that we should ensure that as few languages as possible are led down the Kikongo Path, but instead helped to climb the Swahili Ladder. (Apologies for the crude poetry, but I couldn't resist.) The visualised DD ranking can potentially aid us in that quest. For instance, while the DD numbers indicate that Herero R31, Sesotho S33, Tswana S31, and Xhosa S41 are relatively welldescribed languages, the graph identifies them as potential risklanguages.