A custom model that beats Azure AI on Canon's own datasets, with zero third-party dependency
Canon needed to automate the classification of business documents at scale with a proprietary AI, free of any dependency on third-party providers such as OpenAI. Technological sovereignty was central: the solution had to be GDPR and EU AI Act compliant, run on a private cloud, and integrate natively with the existing document management ecosystem, all while matching or beating off-the-shelf hyperscaler tooling.
We built a custom classification pipeline: Tesseract OCR feeding text features (word2vec, tf-idf, n-grams), with an SVM classifier selected as the winner after benchmarking several algorithms. The model is fully open-source and adapts to the client's own corpus. It is deployed on a private Azure cloud and exposed to the DMS through a secure REST API, packaged as two lightweight Python scripts (one to train and export the model, one for fast inference).
Trusted by Europe's leading organizations