AI Workbench¶

AI Workbench is the platform's model catalog and development environment. It holds every AI model onboarded as an "AI Service of type model" — LLMs, CNNs, custom algorithms, ASR engines, and more.

1. Model Benchmarking¶

Before choosing a model for a use case, AI engineers run benchmarking to identify the best performer:

Generic benchmarking: Use publicly available or domain-specific ground truth datasets. Compare multiple models (e.g., Claude Sonnet, Gemini, GPT-4o) on metrics like accuracy, latency, and cost.
Use-case specific benchmarking: Create a custom ground truth dataset in Data Hub specific to your use case (e.g., legal covenant extraction samples). Run a new benchmark to see which model performs best on your actual data.
Results produce comparison reports showing metric scores across all selected models.

Best practice

Use generic benchmarking to eliminate clearly underperforming models, then run use-case specific benchmarking with your curated ground truth to make the final selection.

2. Model Fine-Tuning¶

When a model's off-the-shelf accuracy is insufficient:

Select the model to fine-tune (e.g., Whisper Large for ASR)
Choose the compute platform and cluster (RunPod, Denvr, Azure)
Select fine-tuning aspects:
- Core performance — general accuracy improvement
- Robustness — performance across acoustic noise levels, environments
- Ethical alignment — bias mitigation, demographic fairness
Upload datasets per aspect to Data Hub; specify train/test split (e.g., 70/30)
Run the fine-tuning job; view iterative reports with metrics like Word Error Rate (WER), Character Error Rate (CER), BLEU scores
Compare multiple fine-tuning run results side-by-side including infrastructure consumption

3. AI Governance¶

Before deploying a model, generate governance reports against organizational policies:

Upload governance policy documents (e.g., Singapore AI regulations, MUFG corporate AI policy)
Select frameworks: Atlas, NIST AI RMF (more to be added)
Create one or more policies (country-level, industry-level, corporate-level)
Run an assessment — generates a scorecard showing model compliance with each policy
All governance reports are stored centrally per model version

The AI Workbench Workflow¶

Benchmark → Fine-Tune → Evaluate → Governance → Baseline & Promote

Benchmark — Identify the best candidate model for your domain
Fine-Tune — Improve accuracy on your specific use case data
Evaluate — Compare metrics across fine-tuning iterations
Governance — Run policy assessments and generate compliance scorecards
Baseline & Promote — Lock the validated model version for inferencing/production