In the ever-changing realm of artificial intelligence, the emergence of Patronus AI is poised to revolutionize how we address AI hallucinations and copyright infringements. With a recent infusion of $17 million in funding, Patronus AI is gearing up to confront these critical issues head-on. Moreover, the company is also preparing for enterprise integration in the petroleum industry, marking a significant milestone in the AI sector. Let’s explore the innovative solutions and cutting-edge technologies that Patronus AI brings to the forefront.
Patronus AI: A New Frontier in AI Oversight
As businesses strive to implement generative AI, concerns surrounding the accuracy and security of language models loom large, potentially hindering widespread adoption. Stepping into this arena is Patronus AI, a San Francisco-based startup that recently secured $17 million in Series A funding to automatically identify costly – and potentially risky – mistakes in language models on a large scale.
This funding round, totaling Patronus AI’s investment to $20 million, was spearheaded by Glenn Solomon at Notable Capital, with the participation of Lightspeed Venture Partners, former DoorDash executive Gokul Rajaram, Factorial Capital, Datadog, and several undisclosed tech executives. Founded by former Meta machine learning (ML) experts Anand Kannappan and Rebecca Qian, Patronus AI has developed an innovative automated evaluation platform that detects errors such as hallucinations, copyright violations, and security breaches in language model outputs. Using proprietary AI technology, the platform assesses model performance, stress-tests models with adversarial examples, and enables detailed benchmarking, all without the manual effort typically required by enterprises today.
Revealing the Dark Side of Generative AI: Hallucinations, Copyright Violations, and Security Risks
“There’s a range of issues that our product excels at identifying in terms of errors,” explained Kannappan, CEO of Patronus AI, in an interview with VentureBeat. “This includes issues like hallucinations, copyright and security-related risks, as well as various industry-specific considerations related to the style and tone of brand content.”
The rise of powerful language models like OpenAI’s GPT-4o and Meta’s Llama 3 has triggered a race in Silicon Valley to leverage the technology’s generative capabilities. However, as excitement mounts, so do high-profile model failures, from tech news outlet CNET publishing error-laden AI-generated articles to biotech startups retracting research papers based on hallucinated molecules by language models.
These public blunders only scratch the surface of broader challenges inherent in today’s language models, according to Patronus AI. The company’s recent research, including the “CopyrightCatcher” API released three months ago and the “FinanceBench” benchmark unveiled six months ago, exposes significant deficiencies in leading models’ ability to accurately respond to real-world queries.
FinanceBench and CopyrightCatcher: Uncovering LLM Deficiencies
In its “FinanceBench” benchmark, Patronus tasked models like GPT-4 with answering financial questions based on public SEC filings. Surprisingly, the top-performing model accurately answered only 19% of questions after analyzing an entire annual report. In a separate experiment using Patronus’ new “CopyrightCatcher” API, open-source language models replicated copyrighted text verbatim in 44% of outputs.
“Even state-of-the-art models have been hallucinating and only achieved around 90% accuracy in finance settings,” noted Qian, serving as CTO. ”Our research has revealed that open-source models produced over 20% incorrect responses in critical areas of concern. Copyright infringement poses a significant threat – media organizations, publishers, or any entity utilizing language models must be vigilant.”
While several startups like Credo AI, Weights & Biases, and Strong Intelligence are developing tools for language model evaluation, Patronus believes its research-driven approach, leveraging the founders’ extensive expertise, sets it apart. The core methodology involves training dedicated evaluation models that pinpoint potential failure points within a given language model.
“No other company currently possesses the level of in-depth research and technology that we do,” Kannappan emphasized. “Our unique approach, centered around research, encompasses training evaluation models, developing new alignment strategies, and publishing research papers.”
This strategy has already gained traction with numerous Fortune 500 companies across diverse sectors, including automotive, education, finance, and software, using Patronus AI to safely deploy language models within their organizations. With the additional funding, Patronus plans to expand its research, engineering, and sales teams while introducing more industry benchmarks.
If Patronus realizes its vision, comprehensive automated evaluation of language models could become a standard requirement for enterprises looking to implement this technology, akin to security audits paving the way for cloud adoption. Qian envisions a future where testing models with Patronus becomes as routine as unit testing code.
“Our platform is versatile, allowing our evaluation technology to be applied across various domains, whether that’s legal, healthcare, or others,” she stated. “We aim to empower enterprises in every sector to leverage the power of language models while ensuring that the models align with their specific use case requirements.”
Nevertheless, due to the black-box nature of base models and the limitless range of potential outputs, definitively validating a language model’s performance remains a challenge. By advancing the state-of-the-art in AI evaluation, Patronus aims to expedite the journey towards responsible real-world deployment.
“Measuring language model performance in an automated manner is inherently complex due to the wide spectrum of behaviors, given the generative nature of these models,” acknowledged Kannappan. “However, through a research-driven approach, we can identify errors in a reliable and scalable manner that manual testing fundamentally cannot achieve.”