This conversational AI company needed help creating training data to improve the safety and reliability of their large language model relative to specific regions, languages, and cultures.

“It’s difficult to create a text dataset that is unbiased and of sufficiently high quality for our model to perform well after pre-training.”

– Head of Machine Learining

 

Problem 

The client is developing a first-to-market Large Language Model (LLM) service for India and needed an R&D partner to develop an approach to reducing harm while optimizing helpfulness in model outputs. The partner needs to understand the nuances of the Indian social context and provide high-quality content and nuanced evaluations.

Solution

Due to the unprecedented nature of the project, iMerit began by establishing a set of best practices for hiring and sourcing talent. To ensure data quality and diversity, iMerit teams systematically scoped, researched, and applied information on a myriad of possible cultural, political, and social domains.

Results

After generating 60,000 prompts that ensured ethical data diversity across 10 different languages, the client included this data in their model. The large language model was released to widespread acclaim, resulting in a growing user base and media coverage.

 

 

BOTTOM LINE IMPACT

60K

Prompts Generated

10

Different Languages

Widespread

Media Coverage