First Coda Analysis


The largest coda analysis I could find was on the much studied Group of Seven off Dominica. Unlike sperm whales of the South Pacific, single unit pods are both common and persistent here, so a cleaner sampling of coda can be taken, that is less influenced by individual characteristics of each animal. I found what I wanted in the doctoral thesis of Mike van der Schaar. Here is how the results presented.

Coda categorise best if they are normalised in their timing. Normalisation may sound ad hoc, but it has a solid basis. In almost all animal vocalisation, including human language, some individuals consistently transmit faster than others (there are fast talkers and slow talkers). This methodology is validated by its consistent success at grouping sperm whale coda types into clusters.

Mike van der Schaar used a modified form of one of the standard ways of classifying normalised sperm whale coda. He then looked at all coda of less than 10 clicks, which he found that it divided into 36 types that he labeled as distinct. He tabled these on pp 38-39. Only three of these coda were represented by a sample of more than a hundred. I will use a slightly different form of his notation to call them here, LS, LR2SR, and 2R, as listed in order of their frequency.

36 distinct types of coda is slightly more than is usually given, but twenty of them had a sample size of less than ten. This would make the vocabulary of coda about typical of the size reported in peer reviewed articles on the topic, but this author had a problem. When analysed by the Duda-Hart criteria the splitting criteria kept being met such that these 30 coda were splitting into 300 odd.

Mike had done his homework. He knew that non-normal clusters can artificially trigger the D-H splitting criterion, and this may break the cluster in two, but you would not expect it to give this result. The only way I know to create this degree of splitting (always assuming the clusters are not naturally poly-modal) should be if each n-dimensional cluster was highly elongated. It is no surprise then that Mike gave this as the likely explanation. The two clusters that were splitting the most were LS and 2R. These were breaking down into about 180 distinct variants by Duda-Hart analysis at the 95% confidence level. Even at the 99% confidence level (which Mike made clear he could find no logical justification to employ), these two were still resolving into 85 variants. This splitting was confidently being dismissed by the authors as artifactual. His reasoning?: visual observations.

Naturally I assumed visual observation was referring to an inspection of a graphical display of the data points yet, on reading further, I found that it meant a visual inspection of the pod. To him, seven whales couldn’t  producing two words in 182 distinct and separate ways – that would require 182 different individual whales! There was only one small problem with all this. LS and 2R were represented in only one dimension!!

I should stop for a moment since here was the point that I was going to delve into pained details as to why the above applied, then I realised that would risk loosing the forest among the trees.

I think I need no explanation, even to the non-mathematically minded, why symmetric unimodal clusters can’t be non-spherically in one dimension. Furthermore, as we reduce the number of dimensions things becomes simpler such that the standard metrics of such analyses have fewer ways in which they can fail. Duda-Hart analysis, doesn’t just assume spherical clusters, it also assumes normal distribution. Spherical non-normal distribution can also break up the clusters. We would not expect them to break up with this degree of efficiency, but it is usually hard to rule the possibility out without much work. Once more we have a situation where, as the number of dimensions get higher, there are more ways a strange distribution can effect our results.

But, when we get all the way down to 1D, there are only three ways that data can trigger this degree of D-H splitting: the coda recording device has a systematic fault or embed truncation error, the analytic software is faulty, or … they are real.

To me, this is the true beginning of the investigation.