← Back to blog overview

HTR and LLMs in manuscript analysis: why the black box is not enough

Between automatic text recognition, large language models and source-bound analysis

and can help make historical manuscripts accessible. For scholarly manuscript analysis, however, smooth textual output is not enough. The decisive question is whether a can be bound back to visible evidence, and the concrete .

Machine access to historical manuscripts may sound like a straightforward technical problem: an image enters a system, and the system returns text.

This is the basic principle behind many classical HTR workflows.

stands for Handwritten Text Recognition, the automatic recognition of handwritten texts. In many projects the basic scheme is:

Image → text recognition → transcription

This can work impressively well, especially when a system has been trained on a particular script, collection or scribe. For large archival holdings, serial sources, registers or relatively uniform hands, HTR can be a powerful tool.

For historical manuscripts, however, a methodological problem remains:

A good output is not automatically a justified output.

The black-box problem

Many AI and HTR systems ultimately deliver a text. What happens between image input and textual output often remains difficult for the user to verify.

The system sees an image, processes it internally and outputs a . That reading may be correct. It may be plausible. It may also have been smoothed, supplemented or stated with false certainty at a critical place.

This is where the black-box problem begins. For scholarly work, the relevant question is not only:

What text does the system output?

The more important question is:

Why does the system arrive at this reading?

For historical manuscripts this question is decisive. A source is not only made of words. It consists of visible findings: line structure, letter forms, , abbreviation marks, ligatures, damaged places, erasures, marginal notes, later additions, stains, material traces and uncertain sign forms.

If a system processes these findings internally but only outputs a smooth text, an essential part of the scholarly work disappears. has not gone away. It has merely become invisible.

HTR: strong, but training-intensive

Classical HTR systems have a clear advantage: they can become very powerful for specific handwriting types.

In most cases, however, they require training data. A human first transcribes and corrects pages as reference material. The system learns from these examples. The more uniform the script, the better the image quality and the larger the amount of training data, the more stable the result can become.

This is useful when many similar pages are available. It becomes problematic when only few pages exist, several scribal hands appear, the script varies strongly, abbreviations are frequent, the material is damaged, or individual remain ambiguous.

HTR can still be helpful in such cases, but the output has to be controlled. Especially in medieval manuscripts, the question is not only whether a word is approximately right. The decisive question is which visible form actually occurs in the source.

LLMs: flexible, but strong in plausibility

Large language models, or , work differently from classical HTR systems. They can describe images, propose text, explain structures, formulate variants and make complex historical material more readable.

This flexibility is a major advantage. It also introduces a specific risk: LLMs are very good at turning incomplete information into plausible language.

For many tasks this is useful. For of manuscripts it is risky. An LLM can turn an uncertain sign into a fluent reading, complete a damaged place, or expand an abbreviation because the context makes it likely.

The problem is not that the model is weak. The problem is that it is too good at transforming uncertainty into language.

Reading or justification?

Historical manuscript work is not only about producing text. It is about justifying a reading.

A scholarly reading should be able to show the image area to which it refers, the visible sign forms on which it rests, which places are secure, which remain uncertain, which alternatives are possible, and whether a damaged place was really read or reconstructed.

Equally important is whether the coordinate refers to the original image, to a segment or to a . Did the model actually see the image version to which the finding refers?

For simple text recognition, such questions are often secondary. For source-bound manuscript analysis, they are central.

The model input is not automatically the source

Another point is often underestimated: the image analyzed by an AI model is not necessarily identical with the original image file.

Between the original image and the model input there may be upload, scaling, compression, format conversion, cropping, segmentation, internal preprocessing or resolution reduction.

An AI statement about an image is therefore first a statement about exactly the image version that was actually passed to the model. It is not automatically a on the original.

Serious manuscript analysis must document which image was analyzed: original file, image dimensions, hash value, segment, crop, model input, coordinate space and transformations.

See also: When the model input is not the source.

Why “image → text” is not enough

The classical objective of many systems is simple: image should become text. For many applications this is sufficient. For scholarly manuscript analysis it is not.

In historical sources, text is not simply present in the image like machine-readable code. It has to be inferred from visible traces.

Image → finding → structure → segment → glyph → minim cluster → abbreviation → candidate → reading → quality control

Each level can introduce errors. If a system skips these levels and directly outputs text, it creates apparent clarity. The output looks clean, but the path to it remains unclear.

This is especially problematic with , , damaged letters, ligatures, marginal signs, superscript or subscript forms, later corrections and poorly preserved areas.

What HTR and LLMs can contribute

HTR and LLMs are not useless. On the contrary, both can play important roles.

HTR can help process larger quantities of similar handwriting. LLMs can help structure findings, formulate variants, express uncertainty clearly and make complex analysis processes more intelligible.

But neither system should be treated as the final authority. They are tools, not the source.

The goal: transparent transcription

Transparent manuscript analysis should not only say: this is what stands there. It should show why a reading is proposed. It should also show where the reading remains uncertain.

This requires systems that store not only final text but analysis artifacts: image segments, coordinates, layout data, glyph findings, minim clusters, variants, uncertainty markers, quality checks and model-input records.

In this way, mere text recognition becomes verifiable analysis.

Avoiding the black box does not mean avoiding AI

The critical issue is not AI yes or no. The critical issue is the role assigned to AI.

If AI is used as a black box, it produces text while the user cannot see whether the reading rests on visible evidence or on plausibility.

If AI is embedded in a traceable pipeline, it can be useful. The order is then not:

Image → AI → finished text

but:

Image → documented finding → technical analysis → AI-assisted evaluation → variant check → uncertainty status → justified reading

This changes the role of AI fundamentally. It is not the judge of the source, but a tool within a controlled analysis process.

Training effort is not the only problem

The discussion is often reduced to a practical question: do we need training data or not?

HTR often needs project-specific training. LLMs usually do not require this kind of user-level training because they are broadly pretrained. But that does not automatically solve the scholarly problem.

Even an LLM without project-specific training must show what its statement rests on. The decisive question is not only how much training a system needs, but how well it can bind its reading back to visible evidence.

Uncertainty is not a defect

In many automatic systems, uncertainty appears as a defect. The system is expected to be secure, smooth and complete.

For historical sources this is problematic. An uncertain place is not automatically a system error. It may be a real state of the source.

Perhaps the letter is damaged. Perhaps the abbreviation is ambiguous. Perhaps a minim cluster cannot be resolved. Perhaps the image quality is insufficient. Perhaps several plausible readings remain possible.

In that case the correct output is not a smooth text. The correct output is marked uncertainty.

Conclusion: not faster to text, but cleaner toward reading

HTR and LLMs are changing work with historical manuscripts. They can make access easier, process large amounts of material and generate initial readings.

But they do not automatically solve the basic problem of source criticism. In historical manuscripts, text is not only an output. It is a hypothesis about visible traces.

A black-box output is therefore not enough. Scholarly users need answers not only to the question of what stands there, but also: where can it be seen, how certain is it, which alternatives remain possible, which image version was analyzed, and which uncertainty was documented?

The future of digital manuscript analysis does not lie only in better models. It lies in better systems of evidence.

AI does not create the truth. The source remains the standard.