Tormenting a frontier model
to fail 100% time

Chain-of-Thought
Math Problems

Math
Subdomains

Week
Turnaround

Leading Foundation Model company leveraged iMerit Scholars and Ango Deep Reasoning Lab to create original math problems capable of stumping their current development model.

Challenge

A leading foundation model team needed a corpus of 600 original mathematical chain-of-thought prompts and multi-turn human-model interactions, including turn-by-turn evaluation of model output, step-wise corrections, and reasoning traces, in order to construct a test set for the latest reasoning models. The delivery timeline was 2 weeks.

The key challenge was to construct problems that would make the model fail. Additionally, the development build of the model had to be integrated into Ango Deep Reasoner in order to interact with it and correct its mis-steps. Finally, a JSON trace of the entire session including failed attempts by the model, had to be generated and delivered as the output.

Solution

60+ math MA and PhD Scholars Recruited
66% rejected during onboarding assessments and selection
Real-time model integration
Continuous prompt / response evaluation, correction and training

To improve model performance, iMerit’s sourced over 60 math MA and PhD for the Scholars program. Each candidate was assessed beyond their credentials based on a set on sample problems, evaluating both their ability to prompt the model to find areas where it provided incorrect answers, but also their ability to teach the model the correct steps to solve the problems.

Over 40 mathematician (66%) were rejected due to lack of problem-setting skills and not having the proper mindset to “teach” the model how to solve the incorrect math problems.

As part of the solution iMerit integrated the Ango Deep Reasoning Lab’s workflow manager with a Bring-Your-Own-Model for the customer’s dev model and built an interface for step-by-step evaluation and correction.

To ensure high quality and consistency iMerit created a custom, automation-assisted tool to ensure schema compliance, mathematical notation support, and computational accuracy. They also integrated semantic deduplication to improve overall training data.

Result

The rigorous pre-qualification and vetting process ensured only qualified experts provided the prompt response generation.

The Ango platform and advanced tools provided a highly customized interface for presenting tasks to users, ensuring the tasks were displayed in a manner that maximized clarity and engagement, making it easier to understand and complete.

Over 600 original puzzles with 100% model reasoning failure rate were delivered to the client within 2 weeks.

Aside from the RLHF training data, the client learned what types of puzzles their model performed poorly on and could plan future projects for improvement.