benchmark 0 submissions · 0 areas
simulation
tested
open gpt-4o
anth claude-3-7-sonnet-20250219
simulator
turns
rounds
3
evaluation
judge
open gpt-4o
rounds
3
run