OpenAI’s Latest Misstep Illuminates Challenges for Chinese AI Models
The recent launch of the GPT-4o AI model by OpenAI was intended to be a groundbreaking moment, but it quickly turned into a crisis for the company. From the resignation of key security personnel to allegations from Scarlett Johansson regarding unauthorized use of her voice, OpenAI is now in damage control mode. However, a major issue that OpenAI overlooked with GPT-4o is the tainted data used to train its tokenizer, leading to a Chinese token library filled with inappropriate content related to pornography and gambling. This oversight poses significant risks for the model, including hallucinations, poor performance, and potential misuse.
Upon closer inspection by researchers and industry insiders, it was discovered that over 90% of the longest Chinese tokens in GPT-4o are sourced from spam websites. These tokens include phrases like “free Chinese porn video to watch,” “Beijing betting car,” and “China welfare lottery daily.” This raises concerns about the quality and integrity of the training data used by OpenAI for its AI models, prompting questions about the company’s data sourcing practices.
The presence of such explicit and irrelevant tokens in GPT-4o’s Chinese language library reflects a broader issue with the availability of quality Chinese text datasets for training large language models (LLMs). Unlike Western countries where data sharing is more common, China’s internet landscape is dominated by major tech companies like Tencent and ByteDance, which restrict access to their data for competitive reasons. As a result, AI companies face challenges in obtaining diverse and reliable training data for Chinese language models.
The lack of suitable training data not only impacts the performance of AI models like GPT-4o but also hinders the development of AI products and services for Chinese users. It is crucial for companies like OpenAI to invest in creating and curating their own datasets to ensure the accuracy and reliability of their AI models. Without access to high-quality training data, AI companies will continue to face limitations in delivering effective and culturally relevant products to Chinese-speaking populations.
the recent controversy surrounding OpenAI’s GPT-4o highlights the importance of ethical data sourcing and data cleaning practices in AI development. Addressing the challenges of acquiring quality training data for Chinese language models is essential for advancing the field of artificial intelligence and delivering innovative solutions to diverse global audiences.
Now read the rest of China Document:
China’s National Health Commission is exploring stricter regulations around human genetic data to support the biotech industry. A law enacted in 1998 required research involving genetic data to undergo approval, with additional scrutiny for projects involving foreign institutions. The government is now considering revising the law to streamline the approval process for smaller-scale research and international entities, aiming to facilitate the growth of biotech research in China.
Did you know that Beijing Capital International Airport has been using birds of prey since 2019 to deter other birds that pose a threat to aircraft? These raptors, including Eurasian hobbies and goshawks, are trained to scare off migratory birds and ensure aviation safety at the airport.