Monday, July 20, 2020

Exploring historical research with modern translation tools

by Mary Harrsch © 2020

As an education technologist I used computer-assisted translation for a couple of decades with mixed success whenever I was faced with research reports I needed to analyze in a language other than English.  Gradually, over the last ten years, however, I found the new tools introduced by Google and the German company, DeepL GmbH based in Cologne to be increasingly more accurate.  This advancement was particularly of interest to me when I decided to focus on Roman archaeology and the early 18th-19th century excavations in Pompeii in my "second act" after retiring from the university. 

First, a short history of machine translation

Machine translation dates as far back as the 9th century CE to an Arabic cryptographer named Al-Kindi who developed techniques for systemic language translation, including cryptanalysis, frequency analysis, and probability and statistics, which are used in modern machine translation.  Other ideas for machine translation were proposed during the Renaissance by such thinkers as RenĂ© Descartes, who, in 1629,  proposed a universal language, with equivalent ideas in different tongues sharing one symbol.  
The first patents for translating machines using an automatic bilingual dictionary and paper tape was applied for by Georges Artsrouni in the mid-1930s.  But progress was limited until the late 1950s when some of the first computers were developed.  In the meantime, successes in code breaking during World War II and theories about  the universal principles underlying natural language, coupled with computer development, prompted new proposals based on evolving information theory. 

A replica of Alan Turing's "bombe" machine used to crack the 
code used on German Enigma encoding devices 
during 
World War II.  Image courtesy of 
Wikimedia Commons.
Despite modest successes, though, the problem of semantic ambiguity of words or phrases with more than one meaning plagued the development of high quality machine translation.  Then the ALPAC report was published by the U.S. commissioned Automatic Language Processing Advisory Committee in 1966 stating that machine translation was more expensive, less accurate and slower than human translation, and that despite research expenditures, machine translation was not likely to reach the quality of a human translator in the near future.  The report decimated funding and machine translation was virtually abandoned in the U.S. and to a lesser extent in the Soviet Union and the U.K. for more than ten years. In Canada, France and Germany, however, research continued while in the U.S. two companies were eventually founded to provide automated translation services for the Department of Defense.  

As computers became more powerful, however, machine translation capabilities increased dramatically.  Models based on statistical and example-based machine translation were adapted for online use to facilitate global communication and provide the capability for rapid translation of even technical documents and texts using such free tools as Google Translate and DeepL, the German product from the firm in Cologne.

The Hauser in Pompeij Project


During this time I became increasingly interested in learning more about the archaeological remains of a 1st century Roman structure, dubbed the House of the Prince of Naples, in Pompeii. But, I quickly discovered that a comprehensive analysis of the remains published in 1984 was available only in German.  I also searched in vain for a digitized version and found the German hardcopy text, "Hauser in Pompeji (Volume 1): Casa del Principe di Napoli" published by the German Archaeological Institute was now out-of-print and used copies were difficult to locate or prohibitively expensive. I finally found a copy up at the University of Washington and requested it through interlibrary loan.  I decided I would attempt to scan the hard copy and use the latest, much improved, online translation tools to produce an English version of the book.
Atrium of the House of the Prince of Naples in Pompeii
courtesy of Carole Raddato

When I was notified the volume had arrived and went to the university library to retrieve it, I was aghast at its size!  Although it is only 52 pages, the Germans apparently wanted to preserve the precise physical scale of the drawings in the book, so it was produced at a size of 19.6 x 13.6 x 1.2 inches. This made it very unwieldy to handle and  too large to scan with my personal oversize scanner or even the ones available at the university.  Not to be deterred, however, I tried reversing the head of my digital camera tripod so I could use it as a copy stand and photograph each page. 

Camera tripod with head reversed to create
a copy stand for a large book

I knew I could then OCR the resulting images and extract the text. My resulting tests with this approach were successful. To speed up the process, I centered the first right hand page as precisely as I could under the camera then photographed all the right hand pages without having to readjust the book's position between pages.  Then I centered the first left hand page under the camera and photographed all the left hand pages.  
I initially tried to use a sheet of glass to flatten the page and hold it in place but the overhead lights created too much glare on the glass so instead I pulled the page tawt with one hand and used my other hand to trigger remotely the shutter of my Olympus camera using its Blue Tooth connection to my iPhone.  

After I photographed all of the pages, I opened the JPEG for each page in Adobe Photoshop.  To produce as much contrast between text and background as possible (the book was printed on cream-colored paper), I converted the image to black and white with Topaz Labs' BW Effects plug-in Filter and adjusted the histogram for optimum black and white values using Photoshop's built-in Adobe Camera Raw filter.  I also straightened the image if necessary, then saved it first as a .JPG for a text reference when checking OCR accuracy, then exported the images as an Adobe PDF. I then opened the PDF in Microsoft Word 2016 (the latest version I have) and allowed it to use its built-in OCR feature to extract the text and create a Word (DOCX) file.  After proofreading several pages I noticed that the OCR was very accurate except for the footnotes because of their small font size (especially the letters o,e, i,j, and rn in place of m). So I decided to forgo proofing each page word for word except the footnotes to reduce my eye strain. I thought I would discover any OCR mistakes in the body of the text anyway when I translated a page and the translator could not make out a word, and that proved to be the case.

Translation finally begins


Now with the extracted text I was finally able to begin my translation. I started out using Google Translate then I compared its results with those of the free German translator, DeepL since I was translating mostly German except for the artifact find summaries that were left in their original 19th century Italian. As I worked, I noticed DeepL seemed to have a slight edge over Google Translate with the German text (understandable) although they were pretty comparable when translating Italian.  But, DeepL offers a Windows 10 add-in tool that can be activated by highlighting the passage you wish to translate then pressing Cntrl-C twice.  This helped to speed things up. The other really nice thing about DeepL is the ability to click on a word that seems weird to you in the English pane and a list pops up of other words it could mean in a particular context.  I was able to make some sentences sound more natural using this feature.  So I began using DeepL as my default translator and used Google Translate to verify particularly awkward passages.  Of course each time I reached a finds summary in Italian instead of German I had to translate that portion separately  because neither of the  translators could switch from one language to another within the same selection.

When translating the Italian find summaries from the late 19th century, however, I also encountered words that were either unique to Italian archaeologists or were no longer in use in modern Italian.  Fortunately, one of my Facebook friends majored in Italian at university and he was quite helpful in teasing out the meanings of some of the terms used.  He told me that part of the problem was that the Italian used in the excavation report appeared to be a regional dialect, and not the modern Tuscan version presently taught in universities.  Furthermore, the Italian archaeologists used special words, such as procoe, lagena, oleare, odorino, cocciopesto, punteggiato regolare,  that had no modern definition in current bilingual dictionaries.  

Architectural terms were also used by the German archaeologists that I had not encountered before either, such as lesbian cyma, dentil cornice (that kept being translated as "tooth-cut" cornice) and socle (that kept being translated as "base" which is descriptive of the architectural feature but not the proper term.)  So, as I worked I began compiling a glossary of these terms and researched their definitions which I later added as an addedum to my English version since the original text did not include a glossary.

The original text also included examples of both Latin and Greek "tituli picti", inscriptions on ancient amphora and other artifacts, using ancient letters not in my font collection.  For these references, I ended up just photographing them separately and embedding them in the text as images rather than text.
Example of "tituli picti" on an amphora in
The House of the Prince of Naples 

Footnote numbers also interfered with the translation.  At first I removed them but I wouldn't do that again since the placement of adjectives and adverbs is different in German (and Italian) than in English so replacing the footnotes afterwards was difficult because I couldn't just count sentences. 

Translators also don't always reword something in the proper English order either and I would have to read through the snippet I was translating and reword phrases with adverbs and adjectives to place them in the proper order to make the sentence read naturally from an English language perspective. 
To get around the problem with the superscripted footnote numbers that the translators could not understand, I added a space before the footnote number in the German pane of the translator so the translator would not become confused when translating the preceding word and would pass the number intact through to the English pane.  This gave me the indicator I needed to insert a footnote with Word after the translation was completed.

Capitalization was also an issue since the Germans capitalize all nouns not just proper nouns and many of these were passed through the translator.  I ended up using Word's search and replace feature after I completed the translation to correct a number of incorrect capitalizations.

The original text was lavishly illustrated and I wanted the English version illustrated as well.    So, I photographed each image separately without it's related caption then translated the German caption and inserted the caption in English with Word.  Some of the images were  graphics with German labels that required me to "paint out" the German references in Photoshop and replace them with English using the Text tool.  

I decided not to place all the images at the end of the text like Professor Strocka did. Instead I context mapped each image to the appropriate place in the narrative then allowed Word to assign a figure number and simply referred to the original plate number in the image caption.  I also added some Creative Content-licensed images that were not available at the time the book was originally published to illustrate some of the finds and painting comparisons made with other artwork elsewhere in Pompeii. 
 
I also did not translate or include the original index because a digital version can be searched at any point in the text.  I also did not bother to include the table of grayscale to color matrix for the grisailles produced for the interior paintings. Whenever you photograph a black and white image with a digital camera, the camera is calibrated to produce the image with an algorithm to create neutral gray for the midpoint of the visual scale.  Therefore, the precise gray scale values of my digital images would have been altered and no longer match the matrix included in the original text. 

Professor Strocka spent a great deal of effort describing the painted decorations in each room but I found his verbal descriptions rather difficult to follow. Although I did translate these descriptions and include them in the English version, I think a detailed map of images accompanied by a discussion of possible motivations for mythological content or style classification would be more easily understood.  In the book's comparisons of the decor of the house with other documented iconography elsewhere in Pompeii, I think visual comparisons of actual images would have been more effective as well. That is why I hunted for at least some of those that still exist to augment that section of the text.

With this much effort, did I actually learn anything particularly significant? Definitely!  When I translated the artifact find summaries, I discovered three surgical instruments and a mortar and pestle were found in the original excavations. I also learned of the intriguing find of human skeletal remains in the cubiculum (bedroom) flanking the main entrance designated as entrance 8.  Further research revealed the existence of a list of houses in Pompeii where surgical instruments were found and, researchers suspect, may have been homes of physicians.  The House of the Prince of Naples was not on the list and appears to have been overlooked. 

Professor Strocka's team clearly focused on the construction aspects of the house and on full documentation of any surviving decorations.  Household inventory, however, was not evaluated, but simply included from original late 19th century excavation records.  I discovered through additional research that small finds, especially those of a non-luxury nature, were viewed with little interest by archaeologists during the late 1800s.  Furthermore, the House of the Prince of Naples was far from an undisturbed site. Three so-called "robber" holes were found in the cubiciulum containing the skeleton when it was eventually excavated in 1896-1898.  Artifacts could have been carried away by salvaging owners shortly after the eruption, or looters, either in antiquity or in modern times, who could have pilfered more valuable objects. The other potential "contamination" of the finds was the staged "excavation" by the Prince and Princess of Naples in 1898. We are led to believe the original excavators merely suspected the presence of finds in certain rooms of the structure and left them in situ for discovery by the royals.  But we must recognize the obvious connections between wealthy patrons and the archaeologists who desired to continue site exploration. If the finds were, in fact, "planted" for the royals to find, it is of no consequence if they were originally found in the structure by the original excavation team, although it would reduce the value any analysis of find assemblages and room function.  If, however, the finds were supplemented from the substantial inventory of finds recovered previously from other structures in Pompeii and had no connection with this structure, future comparisons of this structure with others possibly occupied by residents engaged in a similar occupation or in a similar social position would be tainted. 

At least apparently overlooked information was obtained from the project that could seed further research.  The results of this project can be reviewed here in .pdf form:


The Challenge Continues


Recently, I have begun another project, an article about the excavation of the House of Sallust (originally the House of Acteon) conducted between 1805 and 1809.  The excavation report was published in Italian as Pompeianarum Antiquitatum Historia. Leaping back in time another 80 years from the excavation reports of my first translation, though, I have found this excavation journal has proved even more challenging.  Some of the words used in it are unknown by modern translators.  On top of this, of course, are the special names given to the objects recovered by the early archaeologists that have no modern equivalent.  I also discovered spelling differences that cause translation problems, too.  Some words in the early 1800s are spelled with a "j" instead of an "i"  such as operaj which is now operai (workers) or caldaja which is now spelled caldaia (boiler).  Fortunately, when the translation fails, I can usually isolate the word that is problematic and "sometimes" the Italian-English dictionary will find a word that is the closest match and I can tease out the meaning from it.

At least with this project, I did not have to photograph and OCR anything.  I found a copy up on Google Play and can highlight a section of the text and select COPY from the popup menu then paste it into DeepL or Google translate.  The only problem this causes is that hyphenated words at the end of a sentence are output with only a space instead of a dash so when I paste the section into DeepL I then have to go back and check for line breaks in the original text, find them in the DeepL copy and remove the space so the translator identifies the complete word.  Using Google Play you also have the ability to select TRANSLATE from the popup menu but the result is less than optimal.  It will give you sort of the gist of what is being said but is not as accurate as I need, especially for the list of small finds that often includes specialized words.

The antiquated font used in the original text also causes problems with translation, especially the number 1 (the translators think its a 4) and fractions which I have to manually correct. 
In the early 19th century, Italy had not yet converted to the metric system either so measurements are given with the abbreviations: on. (oncia which equals .73 in.), pal. (palmo, which I assume is the palmo of Naples at 10.381 in. and not the palmo of the Papal States at the time which ranges from 8.79 to 8.347) and occasionally min. (minuti = .146 in. since 5 min. = 1 on.)  

As I am primarily focused on the years 1804-1809, I will translate that portion of the text and make it available when I am finished.

Tools needed for a translation project using an original hardcopy:

Adobe Photoshop or image editor with the ability to Export as .PDF

Microsoft Word 2016 or newer (built-in OCR and PDF export capability)

DeepL Translator (free) & DeepL Windows 10 add-in translator (free)

Google Translate (free)

Pre-metric list of Italian units of measurement:

German to English online dictionary (free)

Italian to English online dictionary (free)

Latin to English online dictionary (free)

and some great bi-lingual friends on Facebook!


Machine Translation History References:


DuPont, Quinn (January 2018). "The Cryptological Origins of Machine Translation: From al-Kindi to Weaver". Amodern (8).

Knowlson, James (1975). UNIVERSAL LANGUAGE SCHEMES IN ENGLAND AND FRANCE 1600-1800 ISBN 978-4-87502-214-5

White, John S. (31 July 2003). Envisioning Machine Translation in the Information Future: 4th Conference of the Association for Machine Translation in the Americas, AMTA 2000, Cuernavaca, Mexico, October 10-14, 2000 Proceedings. Springer. ISBN 9783540399650.

"Google Switches to Its Own Translation System". 22 October 2007. Retrieved 12 February 2018.