*A lot of our effort goes into creating benchmarks to objectively evaluate our model’s performance, and we’ve open-sourced them here:
https://github.com/medaks/medask-benchmarks
We’ve developed both an OSCE-style diagnostic benchmark and a medical triage classification task.
The 12% improvement over o3 comes from our triage benchmark, which evaluates models on clinical vignettes across emergency, non-emergency, and self-care classifications.
Maybe you'll want to change the outcome wording to work without advancing "You have XXX" for the obvious reason, but also because you provide several diagnostics, not the union of them.
We’ve developed both an OSCE-style diagnostic benchmark and a medical triage classification task. The 12% improvement over o3 comes from our triage benchmark, which evaluates models on clinical vignettes across emergency, non-emergency, and self-care classifications.
Maybe you'll want to change the outcome wording to work without advancing "You have XXX" for the obvious reason, but also because you provide several diagnostics, not the union of them.