← All projects
AI · Document Intelligence · Canon

Canon: a sovereign AI document classification engine that outperforms the hyperscaler

A custom model that beats Azure AI on Canon's own datasets, with zero third-party dependency

94.7%
accuracy vs Azure's 84.2%
AI document classifier outperforming Azure: 94.7% vs 84.2% accuracy

Challenge

Canon needed to automate the classification of business documents at scale with a proprietary AI, free of any dependency on third-party providers such as OpenAI. Technological sovereignty was central: the solution had to be GDPR and EU AI Act compliant, run on a private cloud, and integrate natively with the existing document management ecosystem, all while matching or beating off-the-shelf hyperscaler tooling.

Approach

We built a custom classification pipeline: Tesseract OCR feeding text features (word2vec, tf-idf, n-grams), with an SVM classifier selected as the winner after benchmarking several algorithms. The model is fully open-source and adapts to the client's own corpus. It is deployed on a private Azure cloud and exposed to the DMS through a secure REST API, packaged as two lightweight Python scripts (one to train and export the model, one for fast inference).

Outcomes

  • Outperformed Azure AI Document Intelligence on 2 of 3 of Canon's own datasets (94.73% vs 84.21%, 87.5% vs 85.71%, with a 100% tie on the third)
  • Fully open-source model, zero dependency on external AI providers
  • Private-cloud deployment aligned with GDPR and the EU AI Act
  • Reference architecture for sovereign document AI in regulated sectors

Let's find out
if we're a fit

A short call to discuss your project, with no obligation. We respond within one business day.

Book a Call

Trusted by Europe's leading organizations

T-Systems Oracle European Commission Canon Toll4Europe Deutsche Telekom Satellic