I’m currently transliterating/translating an index to persons mentioned in Yāqūt al-Ḥamawī’s famous Muʿjam al-Buldān. It’s wonderful working with encyclopedic texts; I have had the opportunity to explore everything from Parthian (Arsacid) rulers and ancient Arab battles over horses to love poets and hadith transmitters. But today I came across a very curious entry in the index:
البخاري (محمد بن إسماعيل بخت نصر): (1) 208، 257، 310، 376، 384، 479 (2) 91، 329، 331 (3) 272، 281، 400 (4) 78 (5) 140، 410، 453
which translates to:
al-Bukhārī (Muḥammad b. Ismāʿīl Bukht Naṣar): I: 208, 257, 310, 376, 384, 479; II: 91, 329, 331; III: 272, 281, 400; IV: 78; V: 140, 410, 453
The oddity is that “al-Bukhārī” (d. 870 CE) is perhaps the most famous transmitter of sayings ascribed to Muhammad, while “Bukht Naṣar” is the Arabic spelling of Nebuchadnezzar (6th C BCE). What’s going on here?
First, some back story: this is part of work I am doing in developing the Historical Index of the Medieval Middle East (HIMME). I have selected a small number of fabulously informative texts in various languages as a starting package, and am heavily reliant upon indices already compiled in order to create a “union index” across language boundaries. The very generous Open Islamicate Texts Initiative (OpenITI) has made available thousands of OCR‘ed Arabic texts, and in the case of Yāqūt al-Ḥamawī’s Muʿjam al-Buldān, the 1995 edition which they made available has an anonymous index! Yāqūt has always been my go-to reference for place-names, but inspecting the person index revealed great wealth on that side as well. So this one indexed Arabic text into HIMME will contribute primary source references to over 15,000 places and over 12,000 persons.
But one of the goals of HIMME is to make visible to people who do not yet know Middle Eastern languages the riches available in those literatures, so I needed to set about to itemize the index so that I could unite it with other indices, and to Romanize it so that an Anglophone audience could use it. Within a few minutes I had a spreadsheet with columns for a unique identifier for each entry, the Arabic entry, the transliterated entry, the page numbers, and the always necessary notes field. Supplied with a list of Arabic strings in the index sorted by descending frequency, I set about transliterating from the most common Arabic names (Ibn, Abū, Muḥammad, etc.). It is very satisfying to “replace all” of 3,799 instances of Muḥammad (after, of course, dealing with super-strings, in case any of you were worried). In this manner I transliterated every Arabic term that occurred in the index three or more times (1,953 words), and now I’m working through the terms that only occur once or twice (5,310 words), and almost 70% of the person index is completely transliterated.
As in any human process, the index contains some irregularities. I’ve noticed a few glitches. Sometimes the volume or page numbers are incorrect (usually off by one), and sometimes a dot has been mis-placed. More seriously, in a number of cases two entries have been run together, with no page numbers given for the first entry. For example, an entry might read (in transliteration) “Muḥammad b. Muḥammad b. ʿAbd Allāh al-Tamīmī Muḥammad b. Muḥammad b. ʿAlī al-Marwazī (3) 190.” This is actually two entries for two different Muḥammads, and the page reference is only to the second, so if I wish to supply any pages for the former, I need to search for it. (Fortunately, search in text is not too slow, even in 11M text files.) Regrettably, I do not have access to the precise edition from which the scan was made, so it is impossible to determine what precisely was printed, although I have PDFs of other editions of the text to use. And some things are not glitches, just standard elements of a print index which are not needed in a searchable index, such as cross-references. An Arabic name is complex, so someone might look up al-Tamīmī without knowing what his ism (roughly “first name”) was, and if the index is arranged by ism, the reader needs a cross-reference to find the list of pages relevant to the entry that is sought.
So in this case, the original index as given in the text file I downloaded from OpenITI reads:
البخاري اسمه محمد بن إسماعيل بخت نصر (1) 208، 257، 310، 376، 384 (2) 91،
~~329، 331، (3) 272، 281، 400 (5) 140، 410، 453
(The two tildes at the start of the second line are the line division mark in these texts.) Notice some difficulties with RTL display, but the character sequence is correct. My few minutes of regular expressions that transformed the text file into a spreadsheet evidently failed to note that this is actually two entries, the first a cross-reference (“al-Bukhārī, his name is Muḥammad b. Ismāʿīl”), and the second, one with page numbers (“Bukht Naṣar…”). At some point early in my processing, before I became aware of the cross-reference format, I must have noticed the “his name is” but presumed that the page numbers went with al-Bukhārī, and in the absence of separating punctuation, I replaced “his name is” with parentheses around the rest of the Arabic. This mistake on my part created the nonsense that Nebuchadnezzar was one of the most famous transmitters of quotations of Muḥammad.
As I was working through the rare words, I had not yet transliterated “Bukht,” so I looked at this entry, and I thought Bukht Naṣar an odd nickname for the father of the famous al-Bukhārī. This aroused my suspicions to look at this more closely, and I remembered that this was the Arabic spelling of Nebuchadnezzar, which I had not remembered the first time. Then I noticed that Bukht Naṣar would in fact fit alphabetically between al-Bukhārī and the subsequent entry, so it could be a case of entry conflation. I then stepped through all the instances of Bukht Naṣar in the file in order to confirm that this long line of pages all referred to Nebuchadnezzar rather than to al-Bukhārī. (In the process, I learned that III:272 should be II:272, and II:91 should be II:90, and I found a few other references missed by the original indexer!) Then I checked that al-Bukhārī had a separate entry (with a list of pages) under his ism Muḥammad b. Ismāʿīl, and was relieved to find that it was, so that I did not need to entirely recreate that entry. (According to the index, al-Bukhārī is cited on 51 different pages.) Finally, I did what I should have done first, which was to consult the original version of the index rather than the version I had processed, which is what clarified that the fault in this case was my own rather than a typographer’s.
So the lessons for today are (1) that all data are messy and need to be cleaned; (2) computers are not magic, and they do exactly what you tell them to, even when that is nonsensical; (3) most errors are human-generated, and computers will not fix them; (4) when transforming data, you need to understand it so as not to modify it in silly ways; (5) when transforming data, you need to be alert to possible errors that need to be corrected; and (6) always do version control! Otherwise you might inadvertently assert that a Neo-Babylonian king had a particularly well-known compilation of quotations allegedly from someone who lived twelve hundred years after he did!