benchmark
0 submissions · 0 areas
+ submission
simulation
tested
open
gpt-4o
×
anth
claude-3-7-sonnet-20250219
×
openai
anthropic
deepinfra
+
simulator
openai
anthropic
deepinfra
turns
rounds
−
3
+
evaluation
judge
open
gpt-4o
×
openai
anthropic
deepinfra
+
rounds
−
3
+
run