Berita AI – 2025-08-09(Edisi pagi)

Kata Kunci:GPT-5, Peningkatan Diri AI, Kecerdasan Berwujud, Model Multimodal, Model Bahasa Besar, Pembelajaran Penguatan, Agen AI, Peningkatan Kinerja GPT-5, Platform Robot Genie Envisioner, Prasangka Penilaian Rekrutmen LLM, Konteks Panjang Qwen3, Verifikasi Jawaban CompassVerifier

🔥 Focus

GPT-5 Release: Productization and Performance Enhancement : OpenAI has officially released GPT-5, marking the latest iteration of its flagship model. This release focuses on improving user experience by automatically scheduling base models and deep inference models via a real-time router, balancing speed and intelligence. GPT-5 shows significant improvements in reducing hallucinations, enhancing instruction following, and programming capabilities, setting new records in multiple benchmark tests. Sam Altman likens it to a “retinal display,” emphasizing its practicality as a “Ph.D. level AI” rather than merely a breakthrough in intelligence limits. Although not technically AGI, its faster inference speed and lower running costs are expected to drive wider AI adoption. (Source: MIT Technology Review)

GPT-5 is here. Now what?

Progress in AI Self-Improvement Research : Meta CEO Mark Zuckerberg stated that the company is committed to building AI systems capable of self-improvement. AI has already demonstrated self-improvement capabilities in various aspects, such as continuously optimizing its performance through automatic data augmentation, model architecture search, and reinforcement learning. This trend suggests that future AI systems will be able to learn autonomously and surpass human-set performance boundaries, representing a key path towards achieving higher levels of AI. (Source: MIT Technology Review)

The Download: how AI is improving itself, and hidden greenhouse gases

Genie Envisioner: Unified Robot Manipulation World Model Platform : Researchers have introduced Genie Envisioner (GE), a unified world foundation platform for robot manipulation. GE-Base is an instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real robot interactions. GE-Act maps latent representations to executable action trajectories, enabling precise and generalizable policy inference. GE-Sim, an action-conditioned neural simulator, supports closed-loop policy development. This platform is expected to provide a scalable and practical foundation for instruction-driven general embodied intelligence. (Source: HuggingFace Daily Papers)

ISEval: An Evaluation Framework for Large Multimodal Models’ Ability to Identify Erroneous Inputs : To address whether Large Multimodal Models (LMMs) can proactively identify erroneous inputs, researchers proposed the ISEval evaluation framework. This framework covers seven categories of flawed premises and three evaluation metrics. The study found that most LMMs struggle to proactively detect text defects without explicit guidance and perform differently across various error types. For instance, they are proficient at identifying logical fallacies but perform poorly on superficial language errors and specific conditional flaws. This highlights the urgent need for LMMs to actively validate input effectiveness. (Source: HuggingFace Daily Papers)

Research on Language Bias in LLM Hiring Assessments : A study introduced a benchmark to evaluate Large Language Models (LLMs)’ responses to linguistically discriminatory markers in hiring assessments. Through carefully designed interview simulations, the study found that LLMs systematically penalize certain language patterns, especially vague language, even when content quality is identical. This reveals demographic biases in automated evaluation systems and provides a foundational framework for detecting and measuring language discrimination in AI systems, with broad implications for the fairness of automated decision-making. (Source: HuggingFace Daily Papers)

Qwen3 Series Models Support Million-Level Ultra-Long Context : Alibaba Cloud’s Qwen3-30B-A3B-2507 and Qwen3-235B-A22B-2507 models now support an ultra-long context of up to 1 million tokens. This is achieved through Dual Chunk Attention (DCA) and MInference sparse attention technology, which not only improves generation quality but also triples the inference speed for near-million token sequences. This significantly expands the application potential of LLMs in handling complex tasks such as long documents and codebases, and is compatible with vLLM and SGLang for efficient deployment. (Source: Alibaba_Qwen)

Qwen3 series models support million-level ultra-long context

Anthropic Claude Opus 4.1 and Sonnet 4 Upgrades : Anthropic has released Claude Opus 4.1 and Sonnet 4, with a focus on enhancing Agentic tasks, real-world coding, and reasoning capabilities. The new models feature a “deep thought” function, allowing flexible switching between instant response and deep inference modes, compressing complex tasks that would take hours into minutes. This further strengthens Claude’s positioning in multi-model collaboration scenarios, particularly excelling in complex code review and advanced reasoning tasks. (Source: dl_weekly)

Microsoft Launches Copilot 3D Feature : Microsoft has launched the free Copilot 3D feature, capable of converting 2D images into GLB format 3D models, compatible with various 3D viewers, design tools, and game engines. While currently less effective for animal and human images, this feature provides users with convenient 2D to 3D conversion capabilities, expected to play a role in product design, virtual reality, and other fields, further lowering the barrier to 3D content creation. (Source: The Verge)

HuggingFace Accelerate Releases Multi-GPU Training Guide : HuggingFace, in collaboration with Axolotl, has released the Accelerate ND-Parallel guide, aiming to simplify the combination and application of parallel strategies in multi-GPU training. The guide details strategies such as Data Parallel (DP), Sharded Data Parallel (FSDP), Tensor Parallel (TP), and Context Parallel (CP), and provides examples of mixed parallel configurations, helping developers optimize memory usage and throughput when training large models, effectively addressing communication overhead challenges in multi-node training. (Source: HuggingFace Blog)

HuggingFace Accelerate releases multi-GPU training guide

🧰 Tools

OpenAI Codex CLI: Local Coding Agent in Terminal : OpenAI has released Codex CLI, a lightweight coding Agent that runs locally in the terminal. Users can install it via npm install -g @openai/codex or brew install codex. It supports binding with ChatGPT Plus/Pro/Team accounts for free use of the latest models like GPT-5, or pay-as-you-go via API Key. Codex CLI offers various sandbox modes, including read/write and read-only, and supports custom configurations, aiming to provide developers with efficient and secure local programming assistance. (Source: openai/codex – GitHub Trending)

OpenAI Codex CLI: Local Coding Agent in Terminal

HuggingFace AI Sheets: No-Code Dataset Tool : HuggingFace has launched AI Sheets, an open-source, no-code tool for building, enriching, and transforming datasets using AI models. The tool features a spreadsheet-like interface and supports local deployment or running on the Hugging Face Hub. Users can leverage thousands of open models (including gpt-oss) for model comparison, prompt optimization, data cleaning, classification, analysis, and synthetic data generation. It allows iterative improvement of AI-generated results through manual editing and thumbs-up feedback, and can be exported to the Hub. (Source: HuggingFace Blog)

HuggingFace AI Sheets: No-Code Dataset Tool

Google Agent Development Kit (ADK) and Examples : Google has released the Agent Development Kit (ADK), an open-source, code-first Python toolkit for building, evaluating, and deploying complex AI Agents. ADK supports a rich tool ecosystem, modular multi-agent systems, and flexible deployment. Its sample library adk-samples provides various Agent examples, from conversational bots to multi-agent workflows, aiming to accelerate the Agent development process and integrate with the A2A protocol for remote Agent communication. (Source: google/adk-python – GitHub Trending & google/adk-samples – GitHub Trending)

Google Agent Development Kit (ADK) and Examples

Qwen Code CLI: Free Code Execution Tool : Alibaba Cloud’s Qwen Code CLI offers 2000 free code runs daily, easily launched via the npx @qwen-code/qwen-code@latest command. This tool supports Qwen OAuth and aims to provide developers with a convenient and efficient code writing and testing experience. The Qwen team stated they will continue to optimize this CLI tool and the Qwen-Coder model, striving to achieve Claude Code’s performance level while remaining open source. (Source: Alibaba_Qwen)

Qwen Code CLI: Free Code Execution Tool

📚 Learning

OpenAI Python Library Update : The official OpenAI Python library provides convenient access to the OpenAI REST API, supporting Python 3.8+. The library includes type definitions for all request parameters and response fields, and offers both synchronous and asynchronous clients. Recent updates include beta support for the Realtime API, used for building low-latency, multi-modal conversational experiences, as well as detailed explanations of webhook validation, error handling, request IDs, and retry mechanisms, enhancing development efficiency and robustness. (Source: openai/openai-python – GitHub Trending)

OpenAI Python Library Update

AI Agent Curated List : e2b-dev/awesome-ai-agents is a GitHub repository that collects numerous examples and resources for AI autonomous Agents. This list aims to provide developers with a centralized resource library to understand and learn about different types of AI Agents, covering various application scenarios from simple to complex, serving as important learning material for exploring and building AI Agents. (Source: e2b-dev/awesome-ai-agents – GitHub Trending)

MeanFlow: A New Paradigm for One-Step Generative Diffusion Models : Scientific Space proposes MeanFlow, a new method poised to become the standard for accelerating generation in diffusion models. This method aims to achieve one-step generation by modeling “average velocity” instead of “instantaneous velocity,” overcoming the slow generation speed of traditional diffusion models. MeanFlow boasts clear mathematical principles, can be trained from scratch with a single objective, and its single-step generation performance approaches SOTA, offering a new theoretical and practical direction for accelerating generative AI models. (Source: WeChat)

MeanFlow: A New Paradigm for One-Step Generative Diffusion Models

Full Lifecycle Optimization of Long-Context KV Cache : Microsoft Research Asia shared its KV Cache full lifecycle optimization practices, aiming to address latency and storage challenges in long-context Large Language Model inference. Through the SCBench benchmark and methods like MInference and RetrievalAttention, it significantly reduces Prefilling phase latency and alleviates KV Cache memory pressure. The research emphasizes system-level cross-request optimization and Prefix Caching reuse, providing optimization solutions for the scalability and cost-effectiveness of long-context LLM inference. (Source: WeChat)

Full Lifecycle Optimization of Long-Context KV Cache

Reinforcement Learning Framework FR3E Enhances LLM Exploration Capability : ByteDance, MAP, and the University of Manchester jointly proposed FR3E (First Return, Entropy-Eliciting Explore), a new structured exploration framework designed to address the insufficient exploration problem of LLMs in reinforcement learning. FR3E identifies high-uncertainty tokens in reasoning trajectories to guide diverse rollouts, systematically reconstructing the LLM exploration mechanism to achieve a dynamic balance between exploitation and exploration, significantly outperforming existing methods on multiple mathematical reasoning benchmarks. (Source: WeChat)

Reinforcement Learning Framework FR3E Enhances LLM Exploration Capability

Research on the Correlation between Maximal Values in Self-Attention Mechanism and Contextual Understanding : A new study from ICML 2025 reveals that highly concentrated maximal values exist in the query (Q) and key (K) representations of Large Language Models’ self-attention mechanism, and these values are crucial for contextual knowledge understanding. The study found that this phenomenon is common in models using Rotational Position Encoding (RoPE) and appears in early layers. Disrupting these maximal values leads to a sharp decline in model performance on tasks requiring contextual understanding, providing new directions for LLM design, optimization, and quantization. (Source: WeChat)

Research on the Correlation between Maximal Values in Self-Attention Mechanism and Contextual Understanding

C3 Benchmark: Chinese-English Bilingual Speech Dialogue Model Test Benchmark : Peking University and Tencent jointly released C3 Benchmark, the first comprehensive evaluation benchmark for spoken dialogue models that examines complex phenomena such as pauses, polyphonic characters, homophones, stress, syntactic ambiguity, and polysemy in both Chinese and English. The benchmark includes 1079 real-world scenarios and 1586 audio-text pairs, aiming to directly address the fatal weaknesses of current speech dialogue models and promote their progress in understanding human daily conversations. (Source: WeChat)

C3 Benchmark: Chinese-English Bilingual Speech Dialogue Model Test Benchmark

Chemma: Large Language Model for Organic Chemical Synthesis : Shanghai Jiao Tong University’s AI for Science team released Chemma (Baiyulan Chemical Synthesis Large Model), which for the first time enables a chemical large language model to accelerate the entire organic synthesis process. Chemma achieves results surpassing existing best methods in single-step/multi-step retrosynthesis, yield/selectivity prediction, and reaction optimization, relying solely on chemical knowledge understanding and reasoning capabilities without quantum computing. Its “Co-Chemist” human-computer collaborative active learning framework has been successfully validated in real reactions, providing a new paradigm for chemical discovery. (Source: WeChat)

Chemma: Large Language Model for Organic Chemical Synthesis

Intern-Robotics: Shanghai AI Lab’s Embodied Full-Stack Engine : Shanghai AI Lab released its embodied full-stack engine, Intern-Robotics, aiming to drive the “ChatGPT moment” in embodied intelligence. This engine is an open and shared infrastructure focused on achieving body generalization, scene generalization, and task generalization, emphasizing a job success rate approaching 100%. The team is committed to solving data scarcity issues and gradually achieving zero-shot generalization through the “Real to Sim to Real” technical route and real-world reinforcement learning, accelerating the deployment of embodied intelligence. (Source: WeChat)

Intern-Robotics: Shanghai AI Lab's Embodied Full-Stack Engine

SQLM: AI Self-Questioning and Answering Reasoning Capability Evolution Framework : A team from Carnegie Mellon University proposed SQLM, a self-questioning framework that enhances AI reasoning capabilities through self-questioning and answering without external data. This framework includes two roles: a proposer and a solver, both trained via reinforcement learning to maximize expected rewards. SQLM significantly improves model accuracy across arithmetic, algebra, and programming tasks, providing a scalable, self-sustaining process for Large Language Models to enhance their capabilities in the absence of high-quality human-annotated data. (Source: WeChat)

AI Self-Questioning and Answering Reasoning Capability Evolution Framework SQLM

CompassVerifier: AI Answer Verification Model and Evaluation Dataset : Shanghai AI Lab and the University of Macau jointly released CompassVerifier, a general answer verification model, and VerifierBench, an evaluation dataset, aiming to address the issue of large models’ rapidly advancing training capabilities but lagging answer verification abilities. CompassVerifier is a lightweight yet powerful multi-domain general verifier, optimized based on the Qwen series models, capable of achieving verification accuracy surpassing general large models in mathematics, knowledge, and scientific reasoning. It can also serve as a reinforcement learning reward model, providing precise feedback for LLM iterative optimization. (Source: WeChat)

CompassVerifier: AI Answer Verification Model and Evaluation Dataset

CoAct-1: Computer Usage Agent with Encoding as Action : Researchers proposed CoAct-1, a multi-agent system that enhances actions through encoding, aiming to solve the efficiency and reliability problems of GUI operation Agents in complex tasks. CoAct-1’s Orchestrator can dynamically delegate subtasks to a GUI Operator or a Programmer Agent (which can write and execute Python/Bash scripts), thereby bypassing inefficient GUI operations. This method achieved SOTA success rates on the OSWorld benchmark and significantly improved efficiency, providing a more powerful path for general computer automation. (Source: HuggingFace Daily Papers)

ReMoMask: A New Method for High-Quality Game 3D Motion Generation : Peking University proposed ReMoMask, a retrieval-augmented generation-based Text-to-Motion framework designed to generate fluid and realistic 3D motions from a single instruction with high quality. ReMoMask integrates a momentum bidirectional text-motion model, a semantic spatio-temporal attention mechanism, and RAG-classifier-free guidance to efficiently generate time-coherent motions. This method has set new SOTA performance records on standard benchmarks like HumanML3D and KIT-ML, promising to revolutionize game and animation production workflows. (Source: WeChat)

ReMoMask: A New Method for High-Quality Game 3D Motion Generation

WebAgents Survey: Large Models Empowering Web Automation : Researchers from The Hong Kong Polytechnic University published the first comprehensive survey on WebAgents, summarizing the research progress of large models empowering AI Agents to achieve next-generation Web automation. The survey categorizes representative WebAgents methods from perspectives such as architecture (perception, planning and reasoning, execution), training (data, policy), and trustworthiness (safety, privacy, generalization), and discusses future research directions like fairness, interpretability, datasets, and personalized WebAgents, providing guidance for building smarter and safer Web automation systems. (Source: WeChat)

WebAgents Survey: Large Models Empowering Web Automation

InfiAlign: An Alignment Framework for LLM Reasoning Capabilities : InfiAlign is a scalable and sample-efficient post-training framework that aligns LLMs by combining SFT and DPO to enhance reasoning capabilities. At its core is a powerful data selection pipeline that automatically filters high-quality alignment data from open-source reasoning datasets. InfiAlign achieved performance comparable to DeepSeek-R1-Distill-Qwen-7B on the Qwen2.5-Math-7B-Base model, but using only about 12% of the training data, and significantly improved performance on mathematical reasoning tasks, offering a practical solution for aligning large reasoning models. (Source: HuggingFace Daily Papers)

💼 Business

OpenAI Employee Stock Option Monetization Plan to Prevent Talent Poaching : To counter talent outflow, OpenAI has launched a new employee stock option monetization plan, cashing out at a $500 billion valuation, aiming to retain talent with substantial financial incentives. This move is expected to push OpenAI’s valuation to new highs. Concurrently, ChatGPT’s weekly active users have reached 700 million, paid enterprise users have grown to 5 million, and annual recurring revenue is projected to exceed $20 billion, indicating strong development in OpenAI’s product and commercialization efforts. (Source: QbitAI)

OpenAI employee stock option monetization plan to prevent talent poaching

AWS Builds the Largest AI Model Aggregation Platform : Amazon Web Services (AWS) announced that OpenAI’s gpt-oss model is now accessible via Amazon Bedrock and Amazon SageMaker for the first time, further enriching its model ecosystem under the “Choice Matters” strategy. AWS now offers over 400 mainstream commercial and open-source large models, aiming to enable enterprises to choose the most suitable model based on performance, cost, and task requirements, rather than pursuing a single “strongest” model, thereby promoting multi-model synergy. (Source: QbitAI)

AWS builds the largest AI model aggregation platform

Ant Group Invests in Embodied Intelligence Dexterous Hand Company : Ant Group led a multi-hundred-million yuan angel round investment in Lingxin Qiaoshou, an embodied intelligence company. Lingxin Qiaoshou is the world’s only company to achieve mass production of thousands of high-DOF dexterous hands, holding an 80% market share. Its Linker Hand series dexterous hands boast high degrees of freedom, multi-sensor systems, and cost advantages, already deployed in industrial, medical, and other scenarios. This financing will be used for technology reserves and the construction of data collection facilities, accelerating the deployment of dexterous hands in practical applications. (Source: QbitAI)

Ant Group invests in embodied intelligence dexterous hand company

🌟 Community

GPT-5 User Experience Polarized : After the GPT-5 release, user feedback has been mixed. Some users praised its significant improvements in programming and complex reasoning tasks, finding code generation cleaner and more precise, with strong long-context processing capabilities. However, other users expressed disappointment over a decline in model personalization, creative writing, and emotional support capabilities, deeming it “boring” and “soulless,” with the model routing mechanism leading to unstable experiences, even causing some users to cancel their subscriptions. (Source: Reddit r/ChatGPT & Reddit r/LocalLLaMA & Reddit r/ChatGPT & Reddit r/ChatGPT)

GPT-5 user experience polarized

AI’s Application and Controversy in Parenting : Working parents are increasingly using AI tools like ChatGPT as “co-parents,” leveraging them to plan meals, optimize bedtime routines, and even provide emotional support. AI’s non-judgmental space alleviates parents’ psychological burdens. However, this emerging technology also sparks controversy, including the potential for inaccurate advice, privacy leakage risks (such as the ChatGPT data leak incident), and the possibility that over-reliance on AI could lead to interpersonal isolation and potential environmental impacts. (Source: 36Kr)

AI's application and controversy in parenting

Airbnb User Compensation Incident Due to AI-Forged Images : An incident occurred at Airbnb where a landlord used AI-forged images to defraud a user for compensation, highlighting the risks of AI in customer service. The AI customer service failed to identify the AI-generated images, leading to the user being wrongly determined to owe compensation. Although OpenAI previously launched an image detector, AI’s ability to identify AI still has limitations, especially against “partial forgery” techniques. This incident raises concerns about the reliability of AI content detection tools and the ability of C2C platforms to cope with the impact of deepfake content. (Source: 36Kr)

Airbnb user compensation incident due to AI-forged images

Silicon Valley AI Bigwigs Building Doomsday Bunkers Sparks Heated Discussion : Silicon Valley AI leaders like Mark Zuckerberg and Sam Altman are reportedly building or owning doomsday shelters, sparking public concern about the future development and potential risks of AI. Although they deny any AI-related reasons, this move is still interpreted as preparation for emergencies such as pandemics, cyber warfare, and climate disasters. Community discussions speculate whether those who best understand AI technology have seen signs unknown to the general public, and whether AI development has brought unpredictable risks. (Source: QbitAI)

Silicon Valley AI bigwigs building doomsday bunkers sparks heated discussion

Kaggle AI Chess Championship: o3 Crowned Champion : In the final of the inaugural Google Kaggle AI Chess Championship, OpenAI’s o3 swept Elon Musk’s Grok 4 with a 4-0 victory, claiming the championship. This match was seen as a “proxy war” between OpenAI and xAI, aiming to test large models’ critical thinking, strategic planning, and on-the-spot adaptability. Although Grok 4 had strong momentum previously, it made frequent errors in the final, while o3 demonstrated a systematically stable strategy, remaining undefeated throughout the tournament. (Source: WeChat)

Kaggle AI Chess Championship: o3 Crowned Champion

Discussion: AI Enters the “Trough of Disillusionment” : Extensive discussions have emerged on social media, suggesting that AI has officially entered the “Trough of Disillusionment,” especially after the GPT-5 release. Users point out that AI’s limitations have not been effectively overcome, and the benefits from increased model scale and computing power are diminishing. This view suggests that AI’s progress has become “less obvious,” primarily manifesting in expert domains rather than at a level perceptible to ordinary users, indicating that AI development might be entering a plateau phase, requiring entirely new architectural breakthroughs. (Source: Reddit r/ArtificialInteligence)

Discussion: AI enters the "Trough of Disillusionment"

💡 Other

Docker Warns of MCP Toolchain Security Risks : Docker has issued a warning, stating that AI-driven development toolchains built on the Model Context Protocol (MCP) pose serious security vulnerabilities, including credential leakage, unauthorized file access, and remote code execution, with real-world cases already occurring. These tools embed LLMs into development environments, granting them autonomous operational permissions but lacking isolation and supervision. Docker recommends avoiding installing MCP servers from npm, instead using signed containers, and emphasizes the importance of container isolation and zero-trust networks. (Source: WeChat)

Docker warns of MCP toolchain security risks

Huawei HarmonyOS Application Developer Incentive Program 2025 : Huawei announced that the number of HarmonyOS 5 devices has exceeded ten million and launched the “HarmonyOS Application Developer Incentive Program 2025,” investing over a hundred million yuan in subsidies, with individual developers potentially receiving up to 6 million yuan in rewards. This program aims to accelerate the development of the HarmonyOS ecosystem and attract developers to create applications for AI and multi-device deployment, achieving “develop once, deploy everywhere.” Huawei provides full-stack development support, including technical empowerment, rapid testing, efficient listing, and operation, striving to build a robust developer ecosystem. (Source: WeChat)

Huawei HarmonyOS Application Developer Incentive Program 2025

Domestic AI Supernode Server Yuan Nao SD200 Released : Inspur Information released its supernode AI server, “Yuan Nao SD200,” designed to address the computing power challenges of running trillion-parameter large models. This server adopts an innovatively developed multi-host low-latency memory semantic communication architecture, capable of aggregating 64 local GPU chips, providing a maximum of 4TB unified video memory and 64GB unified memory, supporting trillion ultra-long sequence models. Tests show that SD200 achieves excellent computing power expansion efficiency on models like DeepSeek R1, providing powerful support for AI4 Science and industrial applications. (Source: WeChat)

Domestic AI Supernode Server Yuan Nao SD200 Released