What AI Gets Right, And What It Quietly Changes In Korean And Japanese Financial Documents

This exploratory test looks at how two AI models handle Korean and Japanese financial documents. It was designed to see if AI outputs changed when source material moved between English and Korean or Japanese financial documents and whether AI introduces its own layer of interpretation by making choices about names, structure, and framing that the source material didn't make.

Download Report

What AI Gets Right, And What It Quietly Changes In Korean And Japanese Financial Documents

At a Glance

Both models handled the myriad-based counting systems in Korean and Japanese reliably across every run-parsing 만/억/조 and 万/億/兆 conversions accurately.
When models read foreign-language documents and produce English outputs, they may not preserve names exactly as they appear in the source.
Prompting in the source language eliminated the naming problems entirely, but produced output that was structured differently.

When a Korean appraisal, Japanese loan agreement, or local-market research report lands in a cross-border deal room, most investment teams work from translated summaries rather than the original documents. Earlier this year, ToltIQ's "Lost in Translation: Why Japanese Documents Are Private Equity’s Blind Spot" piece argued that this matters because source documents often carry legal, financial, and organizational distinctions that do not map cleanly into English, while translations are usually treated as reference materials rather than the controlling text. That raised the question we wanted to test: when AI helps bridge that language gap, where does it preserve the source faithfully, and where does it introduce its own layer of interpretation?

As a Korean-English bilingual analyst on ToltIQ’s research team, I helped design and review this exploratory test of how two AI models handle foreign-language financial documents. The goal was not to run a controlled benchmark. This was an exploratory workflow test designed to see where AI outputs changed when the source material moved between English and non-English documents. We tested with Korean and Japanese for two reasons: both use non-Roman scripts that are structurally distant from English, and both Korean and Japanese express large numbers around counting units of ten thousand, not thousand. For example, English moves from thousand to million every three zeros, while Korean and Japanese use new markers every four zeros, using 만/万 for 10,000, 억/億 for 100 million, and 조/兆 for trillion-scale figures.

For the testing, we used CBRE market research reports for Seoul and Tokyo, which are institutional-quality documents available in both English and their native languages. They're not deal documents specifically, but they contain several language-processing challenges that also matter in deal work: numeric units, entity names, and market terminology. Within each paired report, the English and native-language versions provided a practical way to compare how outputs changed when the source language changed.

We tested two ToltIQ-integrated models, Claude Opus 4.7 and GPT 5.5, in two stages. First, we checked whether they could parse Korean and Japanese numeric systems. Then we ran a four-prompt analyst workflow: extracting key market data, listing recent transactions, building a comp table, and synthesizing a market outlook. Each task was designed to test a different part of foreign-language financial document processing.

The Numeric System Test: Where the Models Delivered

The first place we expected failure was numeric extraction. Since Korean and Japanese large-number notation follows a different counting structure from English, we expected numeric extraction to be a meaningful stress test. In private equity diligence, that kind of error is not a rounding issue. If a model confuses 억/億, or 100 million, with 조/兆, or trillion, the output can be off by four orders of magnitude.

In these tests, both models handled this correctly across three separate runs per test. They parsed Korean and Japanese unit conversions accurately, including compound expressions and table headers. Our initial assumptions about numeric conversion turned out to be wrong, which shifted the question: if the models could extract numbers correctly, where could errors appear when the task moved from extraction to interpretation?

Over-Translation: When Models Add Meaning That Isn't in the Source

After the numeric system test, we ran both models through a four-prompt analyst workflow on each report. Each model ran the sequence on the English version and the native-language version of the same document using the same prompts (in English), with the only change being the document language.

Extraction was largely accurate. However, when processing the foreign-language documents, both models did something we didn't expect. For example, the Korean report references a building called “Rene Square.” which is exactly what the English edition of the same report calls the building. Neither document (for both Korean & Japanese) ever uses the word "Renaissance."

Yet both models, when reading the Korean document, output the name of the building as "Renaissance Square." They inferred that "Rene" must be short for "Renaissance" rather than recognizing it as a proper name transliteration. It is interesting as it isn't a hallucination in the usual sense, as the building is real and the context is correct. It is an addition of meaning that the source material didn’t support. The models appeared to decide what the building name should be in English rather than reading what it actually says. To compare this, when the same models read the English version of the same report, they correctly output "Rene Square."

This suggests document language as a likely driver, although model randomness, source-version differences, and entity-name conventions cannot be ruled out (see Appendix 1). This behavior is also consistent with research showing that multilingual models trained predominantly on English carry English-centric patterns into other languages and “inevitably bring traces of English habits into other languages when transferring their notions” (Guo).

A similar pattern appeared in the Japanese report. The English version references “CO-MO-RE YOTSUYA.” When the models read the Japanese source, one returned “Comoré Yotsuya,” adding an accent mark, while another returned “Comore Yotsuya,” dropping the stylized formatting. Those are small changes, but in a comp table or transaction database, small name changes can create matching problems (see Appendix 2 in PDF).

In another prompt, a related but slightly different issue appeared. The English version of the Japan report lists a buyer as "Toyota Fudosan Co., Ltd,” which is the company's established English trade name. The Japanese report uses "トヨタ不動産株式会社" (Toyota Fudōsan Kabushiki Kaisha). GPT 5.5, reading the Japanese document, correctly recognized and used "Toyota Fudosan Co., Ltd." However, Opus 4.7, reading the same Japanese document, translated and output "Toyota Real Estate." Both outputs are linguistically defensible, but only one preserves the established English entity name. In a deal context, legal entity name consistency matters (see Appendix 3).

The practical consequence with the transliteration error is that if the AI outputs the same asset or counterpart differently depending on which language version it reads, a human analyst who is bilingual has to resolve it. To be fair, a bilingual analyst would likely catch both of these naming discrepancies quickly during review.

The larger concern is that the output does not flag the slight change in the name, and across a larger data room, the slight change can make the same entity look like multiple entities across AI outputs, creating duplicate records, missed matches, or inconsistent comp tables.

Existing research on crosslingual consistency points to the same underlying point that LLMs frequently produce inconsistent outputs across languages, particularly when those languages differ in script or linguistic structure, because the model fails to align an entity to a single shared representation (Liu.

Removing the Language Bridge

Another follow up question we had through this testing was whether you can fix the over-translation issue by keeping everything in the source language. For example, prompting in Korean characters on the Korean document, and in Japanese on the Japanese documents. We tested this by translating the same four analyst prompts into Korean and Japanese and running them against the native-language documents.

In those single exploratory runs, the naming issues did not reappear. Both models kept all the correct naming conventions. The analytical conclusions had the same directional recommendations as well regardless of prompt language.

We did notice one difference in output structure. When we prompted the model in English to diagnose the decline in Seoul CRE investment using the Korean-language report, the model produced a market data table with 10 rows, each metric broken out individually. However, when we ran the same prompt in Korean on the same document, the model returned a 5-row table that consolidated multiple metrics into single rows. The data was the same, but it was organized differently (see Appendix 4).

We would not read this as evidence that English prompts produce “more detailed” outputs. It may reflect model formatting behavior, prompt-language effects, the compact nature of Korean and Japanese expression, or simply the non-deterministic nature of model outputs.

The more practical constraint is that source-language output stays in the source language. Source language prompting may reduce over-translation, but if the analysis needs to feed into an English IC memo, someone still has to translate it, and the entity-matching issues may resurface at that stage.

What This Test Suggests

Three patterns came out of this testing.

The first pattern is that numeric extraction worked in these tests. Both models handled the myriad-based counting systems in Korean and Japanese reliably across every run-parsing 만/억/조 and 万/億/兆 conversions accurately.
The second is over-translation. When models read foreign-language documents and produce English outputs, they may not preserve names exactly as they appear in the source. Instead, they may translate, or reshape those names into what seems like more natural English. The risk isn't that the models get the facts wrong, but that they silently change names in ways that could create entity-matching issues across a document set. A bilingual reviewer would catch these quickly, but the pattern is worth being aware of because it isn't flagged in the output.
The third is a trade-off in prompt language. Prompting in the source language eliminated the naming problems entirely, but produced output that was structured differently. The more practical consideration is that source-language output stays in the source language, which means someone still has to bridge back to English if the analysis feeds into an English-language deliverable.

None of these patterns are definitive, as this research was an exploratory test with two models and two document sets. But it suggests that the interaction between AI and foreign-language financial documents is worth paying attention to, especially as these tools become more embedded in analytical workflows.

Implications for Multilingual Diligence

For analysts reviewing AI outputs from multilingual VDRs, this suggests that fluency alone may not be the right measure of reliability. It may be just as important to check whether the model preserved the source document’s choices, especially in areas like:

Scale: Confirm large-number units before relying on financial figures.
Entity-Matching: Check whether company, asset, and counterparty names are preserved exactly rather than translated or normalized.
Consistency: Reconcile repeated names across documents and languages.
Framing: Review tables, metrics, and local terms to see whether the model has reorganized or Anglicized.

ToltIQ's earlier piece on Japanese documents argued that most deal teams are reviewing interpretations of documents, not the documents themselves. What this test adds is that AI introduces its own layer of interpretation by making choices about names, structure, and framing that the source material didn't make. For ToltIQ, the broader goal is to build AI workflows that make these cross-language choices visible rather than leaving them buried in fluent English output. Understanding where those choices happen is the first step toward building workflows that account for them.

The next step is to test whether targeted prompting can reduce issues like over-translation, and whether the same patterns become more pronounced in messier, real-world deal documents. The specific model behaviors may change over time, but the underlying diligence question remains: when analysis crosses languages, what gets preserved or changed, and how do teams know the difference?

Appendices available in the downloadable PDF.