This conversational AI company needed help creating training data to improve the safety and reliability of their large language model relative to specific regions, languages, and cultures.
“It’s difficult to create a text dataset that is unbiased and of sufficiently high quality for our model to perform well after pre-training.”– Head of Machine Learining
Problem
The client is developing a first-to-market Large Language Model (LLM) service for India and needed an R&D partner to develop an approach to reducing harm while optimizing helpfulness in model outputs. The partner needs to understand the nuances of the Indian social context and provide high-quality content and nuanced evaluations.
Solution
Due to the unprecedented nature of the project, iMerit began by establishing a set of best practices for hiring and sourcing talent. To ensure data quality and diversity, iMerit teams systematically scoped, researched, and applied information on a myriad of possible cultural, political, and social domains.
Results
After generating 60,000 prompts that ensured ethical data diversity across 10 different languages, the client included this data in their model. The large language model was released to widespread acclaim, resulting in a growing user base and media coverage.
BOTTOM LINE IMPACT
60K
Prompts Generated
10
Different Languages
Widespread
Media Coverage