2 comments

  • klemenvod 3 days ago
    *A lot of our effort goes into creating benchmarks to objectively evaluate our model’s performance, and we’ve open-sourced them here: https://github.com/medaks/medask-benchmarks

    We’ve developed both an OSCE-style diagnostic benchmark and a medical triage classification task. The 12% improvement over o3 comes from our triage benchmark, which evaluates models on clinical vignettes across emergency, non-emergency, and self-care classifications.

  • aimiao 3 days ago
    It's pretty nice, from what I see!

    Maybe you'll want to change the outcome wording to work without advancing "You have XXX" for the obvious reason, but also because you provide several diagnostics, not the union of them.