【论文速递】2025年04周 (Robotics/Embodied AI/LLM)

news2025/4/27 4:05:04

目录

  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
    • 摘要
  • Evolving Deeper LLM Thinking
    • 摘要
  • Kimi k1.5: Scaling Reinforcement Learning with LLMs
    • 摘要
  • Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
    • 摘要
  • VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
    • 摘要
  • MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
    • 摘要
  • FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
    • 摘要
  • SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
    • 摘要
  • Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
    • 摘要
  • GameFactory: Creating New Games with Generative Interactive Videos
    • 摘要
  • Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
    • 摘要
  • UI-TARS: Pioneering Automated GUI Interaction with Native Agents
    • 摘要
  • Improving Video Generation with Human Feedback
    • 摘要
  • PaSa: An LLM Agent for Comprehensive Academic Paper Search
    • 摘要
  • Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
    • 摘要
  • TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
    • 摘要
  • InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
    • 摘要
  • Autonomy-of-Experts Models
    • 摘要
  • Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
    • 摘要
  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
    • 摘要
  • Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
    • 摘要
  • Reasoning Language Models: A Blueprint
    • 摘要
  • Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
    • 摘要
  • VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
    • 摘要
  • O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
    • 摘要

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  • 作者: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12948

摘要

We introduce our first-generation reasoning models, DeepSeek-R1-Zero andDeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcementlearning (RL) withoutsupervised fine-tuning (SFT)as a preliminary step,demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zeronaturally emerges with numerous powerful and intriguingreasoning behaviors.However, it encounters challenges such as poor readability, and languagemixing. To address these issues and further enhance reasoning performance, weintroduce DeepSeek-R1, which incorporatesmulti-stage trainingand cold-startdata before RL. DeepSeek-R1 achieves performance comparable toOpenAI-o1-1217onreasoning tasks. To support the research community, we open-sourceDeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B,70B) distilled from DeepSeek-R1 based onQwenandLlama.


Evolving Deeper LLM Thinking

  • 作者: Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen

  • 日期: 2025-01-17

  • 论文链接: https://arxiv.org/pdf/2501.09891

摘要

We explore anevolutionary search strategyfor scalinginference time computeinLarge Language Models. The proposed approach,Mind Evolution, uses alanguage model to generate, recombine and refinecandidate responses. Theproposed approach avoids the need to formalize the underlying inference problemwhenever asolution evaluatoris available. Controlling for inference cost, wefind thatMind Evolutionsignificantly outperforms other inference strategiessuch as Best-of-N and Sequential Revision innatural language planning tasks.In theTravelPlannerandNatural Plan benchmarks,Mind Evolutionsolves morethan 98% of the problem instances usingGemini 1.5 Prowithout the use of aformal solver.


Kimi k1.5: Scaling Reinforcement Learning with LLMs

  • 作者: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12599

摘要

Language model pretraining with next token prediction has proved effectivefor scaling compute but is limited to the amount of available training data.Scaling reinforcement learning (RL) unlocks a new axis for the continuedimprovement of artificial intelligence, with the promise that large languagemodels (LLMs) can scale their training data by learning to explore withrewards. However, prior published work has not produced competitive results. Inlight of this, we report on the training practice of Kimi k1.5, our latestmulti-modalLLMtrained withRL, including itsRLtraining techniques,multi-modal data recipes, and infrastructure optimization.Long context scalingand improvedpolicy optimizationmethods are key ingredients of our approach,which establishes a simplistic, effectiveRLframework without relying on morecomplex techniques such asMonte Carlo tree search,value functions, andprocess reward models. Notably, our system achieves state-of-the-art reasoningperformance across multiple benchmarks and modalities – e.g., 77.5 onAIME,96.2 onMATH 500, 94-th percentile onCodeforces, 74.9 onMathVista-- matchingOpenAI’s o1. Moreover, we present effectivelong2shortmethods that uselong-CoTtechniques to improveshort-CoTmodels, yielding state-of-the-artshort-CoTreasoning results – e.g., 60.8 onAIME, 94.6 on MATH500, 47.3 onLiveCodeBench-- outperforming existingshort-CoTmodels such asGPT-4oandClaude Sonnet 3.5by a large margin (up to +550%).


Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

  • 作者: Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen

  • 日期: 2025-01-20

  • 论文链接: https://arxiv.org/pdf/2501.11425

摘要

Large Language Models (LLMs)agents are increasingly pivotal for addressingcomplex tasks ininteractive environments. Existing work mainly focuses onenhancing performance throughbehavior cloningfrom stronger experts, yet suchapproaches often falter in real-world applications, mainly due to the inabilityto recover from errors. However,step-level critique datais difficult andexpensive to collect. Automating and dynamically constructing self-critiquedatasets is thus crucial to empowering models with intelligent agentcapabilities. In this work, we propose aniterative self-training framework,Agent-R, that enables language Agent to Reflect on the fly. Unlike traditionalmethods thatreward or penalize actionsbased on correctness,Agent-RleveragesMCTS to construct training data that recovercorrect trajectoriesfromerroneous ones. A key challenge of agent reflection lies in the necessity fortimely revisionrather than waiting until the end of a rollout. To addressthis, we introduce a model-guided critique construction mechanism: the actormodel identifies the first error step (within its current capability) in afailed trajectory. Starting from it, we splice it with the adjacent correctpath, which shares the same parent node in the tree. This strategy enables themodel to learn reflection based on its current policy, therefore yieldingbetterlearning efficiency. To further explore the scalability of thisself-improvement paradigm, we investigateiterative refinementof both errorcorrection capabilities and dataset construction. Our findings demonstrate thatAgent-Rcontinuously improves the model’s ability to recover from errors andenablestimely error correction. Experiments on threeinteractive environmentsshow thatAgent-Reffectively equips agents to correct erroneous actions whileavoiding loops, achieving superior performance compared tobaseline methods(+5.59%).


VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

  • 作者: Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.13106

  • 项目链接: https://github.com/DAMO-NLP-SG/VideoLLaMA3

摘要

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundationmodel for image and video understanding. The core design philosophy ofVideoLLaMA3 is vision-centric. The meaning of “vision-centric” is two-fold: thevision-centric training paradigmandvision-centric framework design. The keyinsight of ourvision-centric training paradigmis that high-quality image-textdata is crucial for both image and video understanding. Instead of preparingmassive video-text datasets, we focus on constructing large-scale andhigh-quality image-text datasets. VideoLLaMA3 has four training stages: 1)vision-centric alignment stage, which warms up thevision encoderandprojector; 2) vision-language pretraining stage, which jointly tunes the visionencoder,projector, and LLM with large-scale image-text data covering multipletypes (including scene images, documents, charts) as well as text-only data. 3)multi-task fine-tuning stage, which incorporates image-text SFT data fordownstream tasks and video-text data to establish a foundation for videounderstanding. 4) video-centric fine-tuning, which further improves the model’scapability in video understanding. As for the framework design, to bettercapture fine-grained details in images, the pretrainedvision encoderisadapted to encode images of varying sizes into vision tokens with correspondingnumbers, rather than a fixed number of tokens. For video inputs, we reduce thenumber of vision tokens according to their similarity so that therepresentation of videos will be more precise and compact. Benefit fromvision-centric designs, VideoLLaMA3 achieves compelling performances in bothimage and video understanding benchmarks.


MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

  • 作者: Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12380

  • 项目链接: https://mmvu-benchmark.github.io/

摘要

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmarkfor evaluating foundation models in video understanding. MMVU includes 3,000expert-annotated questions spanning 27 subjects across four core disciplines:Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared toprior benchmarks, MMVU features three key advancements. First, it challengesmodels to apply domain-specific knowledge and perform expert-level reasoning toanalyze specialized-domain videos, moving beyond the basic visual perceptiontypically assessed in current video benchmarks. Second, each example isannotated by human experts from scratch. We implement strict data qualitycontrols to ensure the high quality of the dataset. Finally, each example isenriched with expert-annotated reasoning rationals and relevant domainknowledge, facilitating in-depth analysis. We conduct an extensive evaluationof 32 frontier multimodal foundation models on MMVU. The latestSystem-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highestperformance among the tested models. However, they still fall short of matchinghuman expertise. Through in-depth error analyses and case studies, we offeractionable insights for future advancements in expert-level,knowledge-intensive video understanding for specialized domains.


FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

  • 作者: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12909

摘要

Virtual film production requires intricate decision-making processes,including scriptwriting, virtual cinematography, and precise actor positioningand actions. Motivated by recent advances in automated decision-making withlanguage agent-based societies, this paper introduces FilmAgent, a novelLLM-basedmulti-agentcollaborative frameworkforend-to-endfilm automationinour constructed 3D virtual spaces. FilmAgent simulates various crew roles,including directors, screenwriters, actors, and cinematographers, and coverskey stages of a film production workflow: (1) idea development transformsbrainstormed ideas into structured story outlines; (2) scriptwriting elaborateson dialogue and character actions for each scene; (3) cinematography determinesthe camera setups for each shot. A team of agents collaborates throughiterative feedbackand revisions, thereby verifying intermediate scripts andreducinghallucinations. We evaluate the generated videos on 15 ideas and 4 keyaspects.Human evaluationshows that FilmAgent outperforms all baselines acrossall aspects and scores 3.98 out of 5 on average, showing the feasibility ofmulti-agent collaborationin filmmaking. Further analysis reveals thatFilmAgent, despite using the less advancedGPT-4omodel, surpasses thesingle-agent o1, showing the advantage of a well-coordinatedmulti-agentsystem. Lastly, we discuss the complementary strengths and weaknesses ofOpenAI’stext-to-video modelSoraand our FilmAgent in filmmaking.


SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

  • 作者: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.13200

摘要

Multi-agent reinforcement learning (MARL)demonstrates significant progressin solving cooperative and competitive multi-agent problems in variousenvironments. One of the principal challenges in MARL is the need for explicitprediction of the agents’ behavior to achieve cooperation. To resolve thisissue, we propose theShared Recurrent Memory Transformer (SRMT)which extendsmemory transformersto multi-agent settings by pooling and globallybroadcasting individual working memories, enabling agents to exchangeinformation implicitly and coordinate their actions. We evaluate SRMT on thePartially Observable Multi-Agent Pathfinding problem in a toy Bottlenecknavigation task that requires agents to pass through a narrow corridor and on aPOGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistentlyoutperforms a variety of reinforcement learning baselines, especially undersparse rewards, and generalizes effectively to longer corridors than those seenduring training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT iscompetitive with recent MARL, hybrid, and planning-based algorithms. Theseresults suggest that incorporating shared recurrent memory into thetransformer-based architectures can enhance coordination in decentralizedmulti-agent systems. The source code for training and evaluation is availableon GitHub: https://github.com/Aloriosa/srmt.


Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

  • 作者: Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.11873

摘要

This paper revisits the implementation ofLoad-balancing Loss (LBL)when trainingMixture-of-Experts (MoEs)models. Specifically, LBL for MoEs is defined as N_Esum_{i=1}^{N_E} f_i p_i, where N_E is the total number ofexperts, f_irepresents thefrequencyof expert i being selected, and p_i denotes theaveragegating scoreof the expert i. Existing MoE training frameworksusually employ theparallel training strategyso that f_i and the LBL arecalculated within amicro-batchand then averaged across parallelgroups. In essence, amicro-batchfor training billion-scale LLMs normallycontains very fewsequences. So, themicro-batchLBL is almost at thesequencelevel, and therouteris pushed to distribute thetokenevenly within eachsequence. Under this strict constraint, eventokens from a domain-specificsequence(e.g., code) are uniformly routed to allexperts, therebyinhibiting expert specialization. In this work, we propose calculating LBLusing aglobal-batchto loose this constraint. Because aglobal-batchcontains much more diversesequences than amicro-batch, whichwill encourage load balance at thecorpus level. Specifically, we introduce anextracommunication stepto synchronize f_i acrossmicro-batches and then useit to calculate the LBL. Through experiments on training MoEs-based LLMs (up to42.8B total parameters and 400Btokens), we surprisinglyfind that theglobal-batchLBL strategy yields excellent performance gains inboth pre-training perplexity and downstream tasks. Our analysis reveals thattheglobal-batchLBL also greatly improves thedomain specializationof MoEexperts.


GameFactory: Creating New Games with Generative Interactive Videos

  • 作者: Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

  • 日期: 2025-01-14

  • 论文链接: https://arxiv.org/pdf/2501.08325

  • 项目链接: https://yujiwen.github.io/gamefactory/

摘要

Generative game engines have the potential to revolutionize game developmentby autonomously creating new content and reducing manual workload. However,existing video-based game generation methods fail to address the criticalchallenge ofscene generalization, limiting their applicability to existinggames with fixed styles and scenes. In this paper, we present GameFactory, aframework focused on exploringscene generalizationin game video generation.To enable the creation of entirely new and diverse games, we leveragepre-trainedvideo diffusion modelstrained onopen-domain video data. To bridgethe domain gap between open-domain priors and small-scale game dataset, wepropose amulti-phase trainingstrategy that decouplesgame style learningfromaction control, preserving open-domain generalization while achieving actioncontrollability. Using Minecraft as our data source, we releaseGF-Minecraft, ahigh-quality and diversity action-annotated video dataset for research.Furthermore, we extend our framework to enable autoregressiveaction-controllable game video generation, allowing the production ofunlimited-length interactive game videos. Experimental results demonstrate thatGameFactory effectively generates open-domain, diverse, and action-controllablegame videos, representing a significant step forward in AI-driven gamegeneration. Our dataset and project page are publicly available athttps://vvictoryuki.github.io/gamefactory/.


Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

  • 作者: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12895

摘要

Large language models (LLMs) demonstrate impressive performance but lack theflexibility to adapt to human preferences quickly without retraining. In thiswork, we introduceTest-time Preference Optimization (TPO), a framework thataligns LLM outputs with human preferences during inference, removing the needto update model parameters. Rather than relying on purely numerical rewards,TPO translates reward signals intotextual critiquesand uses them as textualrewards to iteratively refine its response. Evaluations on benchmarks coveringinstruction following,preference alignment,safety, andmathematicsrevealthat TPO progressively improves alignment with human preferences. Notably,after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model cansurpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPOscales efficiently with both thesearch widthand depth during inference.Through case studies, we illustrate how TPO exploits the innate capacity of LLMtointerpretandact upon reward signals. Our findings establish TPO as apractical, lightweight alternative for test-time preference optimization,achieving alignment on the fly. Our code is publicly available athttps://github.com/yafuly/TPO.


UI-TARS: Pioneering Automated GUI Interaction with Native Agents

  • 作者: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12326

摘要

This paper introduces UI-TARS, a native GUI agent model that solely perceivesthe screenshots as input and performshuman-like interactions(e.g., keyboardandmouse operations). Unlike prevailing agent frameworks that depend onheavily wrapped commercial models (e.g.,GPT-4o) withexpert-crafted promptsandworkflows, UI-TARS is anend-to-end modelthat outperforms thesesophisticated frameworks. Experiments demonstrate its superior performance:UI-TARS achievesSOTA performancein 10+GUI agent benchmarksevaluatingperception,grounding, and GUI task execution. Notably, in the OSWorldbenchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld,UI-TARS achieves 46.6, surpassingGPT-4o(34.5). UI-TARS incorporates severalkey innovations: (1) Enhanced Perception: leveraging a large-scale dataset ofGUI screenshots for context-aware understanding of UI elements and precisecaptioning; (2)Unified Action Modeling, which standardizes actions into aunified spaceacross platforms and achieves precisegroundingand interactionthrough large-scale action traces; (3)System-2 Reasoning, which incorporatesdeliberate reasoninginto multi-step decision making, involving multiplereasoning patterns such astask decomposition,reflection thinking, milestonerecognition, etc. (4)Iterative Training with Reflective Online Traces, whichaddresses the data bottleneck by automatically collecting,filtering, andreflectively refining newinteraction traceson hundreds ofvirtual machines.Through iterative training and reflection tuning, UI-TARS continuously learnsfrom its mistakes and adapts to unforeseen situations with minimal humanintervention. We also analyze theevolution pathofGUI agentsto guide thefurther development of this domain.


Improving Video Generation with Human Feedback

  • 作者: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang

  • 日期: 2025-01-23

  • 论文链接: https://arxiv.org/pdf/2501.13918

摘要

Video generation has achieved significant advances throughrectified flowtechniques, but issues like unsmoothmotionandmisalignmentbetween videos andprompts persist. In this work, we develop a systematic pipeline that harnesseshuman feedback to mitigate these problems and refine the video generationmodel. Specifically, we begin by constructing a large-scale human preferencedataset focused on modern video generation models, incorporating pairwiseannotations across multi-dimensions. We then introduceVideoReward, amulti-dimensional video reward model, and examine how annotations and variousdesign choices impact its rewarding efficacy. From a unified reinforcementlearning perspective aimed at maximizing reward withKL regularization, weintroduce threealignment algorithmsfor flow-based models by extending thosefromdiffusion models. These include two training-time strategies: directpreference optimization for flow (Flow-DPO) and reward weighted regression forflow (Flow-RWR), and an inference-time technique,Flow-NRG, which appliesreward guidance directly tonoisy videos. Experimental results indicate thatVideoRewardsignificantly outperforms existing reward models, andFlow-DPOdemonstrates superior performance compared to bothFlow-RWRand standardsupervised fine-tuning methods. Additionally,Flow-NRGlets users assign customweights to multiple objectives during inference, meeting personalized videoquality needs. Project page: https://gongyeliu.github.io/videoalign.


PaSa: An LLM Agent for Comprehensive Academic Paper Search

  • 作者: Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

  • 日期: 2025-01-17

  • 论文链接: https://arxiv.org/pdf/2501.10120

摘要

We introduce PaSa, an advanced Paper Search agent powered by large languagemodels. PaSa can autonomously make a series of decisions, including invokingsearch tools, reading papers, and selecting relevant references, to ultimatelyobtain comprehensive and accurate results for complex scholarly queries. Weoptimize PaSa using reinforcement learning with a synthetic dataset,AutoScholarQuery, which includes 35k fine-grained academic queries andcorresponding papers sourced from top-tier AI conference publications.Additionally, we develop RealScholarQuery, a benchmark collecting real-worldacademic queries to assess PaSa performance in more realistic scenarios.Despite being trained on synthetic data, PaSa significantly outperformsexisting baselines on RealScholarQuery, including Google, Google Scholar,Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o),GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably,PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78%in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% inrecall and 4.25% in precision. Model, datasets, and code are available athttps://github.com/bytedance/pasa.


Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

  • 作者: Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang

  • 日期: 2025-01-23

  • 论文链接: https://arxiv.org/pdf/2501.13629

摘要

We introduce Sigma, an efficient large language model specialized for thesystem domain, empowered by a novel architecture includingDiffQKV attention,andpre-trained on our meticulously collectedsystem domain data. DiffQKVattention significantly enhances theinference efficiencyof Sigma byoptimizing theQuery (Q),Key (K), andValue (V)components in the attentionmechanism differentially, based on their varying impacts on the modelperformance andefficiency indicators. Specifically, we (1) conduct extensiveexperiments that demonstrate the model’svarying sensitivityto the compressionof K and V components, leading to the development of differentially compressedKV, and (2) proposeaugmented Qto expand the Q head dimension, which enhancesthe model’srepresentation capacitywith minimal impacts on the inferencespeed. Rigorous theoretical and empirical analyses reveal that DiffQKVattention significantly enhances efficiency, achieving up to a 33.36%improvement ininference speedover the conventional grouped-query attention(GQA) inlong-context scenarios. Wepre-trainSigma on 6T tokens from varioussources, including 19.5Bsystem domain datathat we carefully collect and 1Ttokens of synthesized and rewritten data. In general domains, Sigma achievescomparable performance to otherstate-of-arts models. In the system domain, weintroduce the first comprehensivebenchmark AIMicius, where Sigma demonstratesremarkable performance across all tasks, significantly outperformingGPT-4withan absolute improvement up to 52.5%.


TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

  • 作者: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12224

摘要

We present TokenVerse – a method for multi-concept personalization,leveraging apre-trained text-to-image diffusion model. Our framework candisentangle complex visual elements and attributes from as little as a singleimage, while enabling seamless plug-and-play generation of combinations ofconcepts extracted from multiple images. As opposed to existing works,TokenVerse can handle multiple images with multiple concepts each, and supportsa wide-range of concepts, including objects, accessories, materials, pose, andlighting. Our work exploits aDiT-based text-to-image model, in which the inputtext affects the generation through bothattentionandmodulation(shift andscale). We observe that themodulation spaceis semantic and enables localizedcontrol over complex concepts. Building on this insight, we devise anoptimization-based frameworkthat takes as input an image and a textdescription, and finds for each word a distinct direction in themodulationspace. These directions can then be used to generate new images that combinethe learned concepts in a desired configuration. We demonstrate theeffectiveness of TokenVerse in challenging personalization settings, andshowcase its advantages over existing methods. project’s webpage inhttps://token-verse.github.io/


InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

  • 作者: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12368

摘要

Despite the promising performance ofLarge Vision Language Models (LVLMs)invisual understanding, they occasionally generate incorrect outputs. Whilereward models (RMs)withreinforcement learningortest-time scalingoffer thepotential for improving generation quality, a critical gap remains: publiclyavailable multi-modal RMs for LVLMs are scarce, and the implementation detailsof proprietary models are often unclear. We bridge this gap withInternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effectivemulti-modal reward modelthat aligns LVLMs withhuman preferences. To ensurethe robustness and versatility of IXC-2.5-Reward, we set up a high-qualitymulti-modal preference corpusspanning text, image, and video inputs acrossdiverse domains, such asinstruction following, general understanding,text-rich documents, mathematical reasoning, and video understanding.IXC-2.5-Reward achieves excellent results on the latest multi-modal rewardmodel benchmark and shows competitive performance on text-only reward modelbenchmarks. We further demonstrate three key applications of IXC-2.5-Reward:(1) Providing a supervisory signal for RL training. We integrate IXC-2.5-RewardwithProximal Policy Optimization (PPO)yields IXC-2.5-Chat, which showsconsistent improvements ininstruction followingand multi-modal open-endeddialogue; (2) Selecting the best response fromcandidate responsesfortest-time scaling; and (3) Filteringoutlier or noisy samplesfrom existingimage and video instruction tuningtraining data. To ensure reproducibility andfacilitate further research, we have open-sourced all model weights andtraining recipes at https://github.com/InternLM/InternLM-XComposer


Autonomy-of-Experts Models

  • 作者: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.13074

摘要

Mixture-of-Experts (MoE) modelsmostly use a router to assign tokens tospecificexpert modules, activating onlypartial parametersand oftenoutperformingdense models. We argue that the separation between the router’sdecision-making and the experts’ execution is a critical yet overlooked issue,leading to suboptimalexpert selectionand ineffective learning. To addressthis, we proposeAutonomy-of-Experts (AoE), a novel MoE paradigm in whichexperts autonomously selectthemselves to process inputs. AoE is based on theinsight that an expert is aware of its own capacity to effectively process atoken, an awareness reflected in the scale of itsinternal activations. In AoE,routersare removed; instead, experts pre-computeinternal activationsforinputs and are ranked based on theiractivation norms. Only the top-rankingexperts proceed with theforward pass, while the others abort. The overhead ofpre-computing activations is reduced through alow-rank weight factorization.Thisself-evaluating-then-partner-comparing approachensures improved expertselection andeffective learning. Wepre-train language modelshaving 700M upto 4B parameters, demonstrating that AoE outperforms traditional MoE modelswith comparable efficiency.


Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

  • 作者: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo

  • 日期: 2025-01-21

  • 论文链接: https://arxiv.org/pdf/2501.12202

摘要

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system forgeneratinghigh-resolution textured 3Dassets. This system includes twofoundation components: alarge-scale shape generationmodel --Hunyuan3D-DiT,and alarge-scale texture synthesismodel --Hunyuan3D-Paint. The shapegenerative model, built on ascalable flow-based diffusion transformer, aims tocreate geometry that properly aligns with a given condition image, laying asolid foundation for downstream applications. The texture synthesis model,benefiting from strong geometric and diffusion priors, produces high-resolutionand vibrant texture maps for either generated or hand-crafted meshes.Furthermore, we buildHunyuan3D-Studio-- a versatile, user-friendly productionplatform that simplifies the re-creation process of 3D assets. It allows bothprofessional and amateur users to manipulate or even animate their meshesefficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0outperforms previous state-of-the-art models, including the open-source modelsand closed-source models ingeometry details,condition alignment, texturequality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gapsin the open-source 3D community for large-scale foundationgenerative models.The code and pre-trained weights of our models are available at:https://github.com/Tencent/Hunyuan3D-2


Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

  • 作者: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng

  • 日期: 2025-01-23

  • 论文链接: https://arxiv.org/pdf/2501.13926

摘要

Chain-of-Thought (CoT) reasoning has been extensively explored in largemodels to tackle complex understanding tasks. However, it still remains an openquestion whether such strategies can be applied to verifying and reinforcingimage generation scenarios. In this paper, we provide the first comprehensiveinvestigation of the potential of CoT reasoning to enhance autoregressive imagegeneration. We focus on three techniques:scaling test-time computationforverification, aligning model preferences with Direct Preference Optimization(DPO), and integrating these techniques for complementary effects. Our resultsdemonstrate that these approaches can be effectively adapted and combined tosignificantly improve image generation performance. Furthermore, given thepivotal role ofreward modelsin our findings, we propose the PotentialAssessment Reward Model (PARM) andPARM++, specialized for autoregressive imagegeneration. PARM adaptively assesses each generation step through a potentialassessment approach, merging the strengths of existingreward models, andPARM++further introduces areflection mechanismto self-correct the generatedunsatisfactory image. Using our investigated reasoning strategies, we enhance abaseline model, Show-o, to achieve superior results, with a significant +24%improvement on theGenEval benchmark, surpassingStable Diffusion 3by +15%. Wehope our study provides unique insights and paves a new path for integratingCoT reasoning withautoregressive image generation. Code and models arereleased at https://github.com/ZiyuGuo99/Image-Generation-CoT


Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

  • 作者: Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego

  • 日期: 2025-01-16

  • 论文链接: https://arxiv.org/pdf/2501.09775

摘要

One of the most widely used methods to evaluate LLMs are Multiple ChoiceQuestion (MCQ) tests.MCQ benchmarksenable the testing of LLM knowledge onalmost any topic at scale as the results can be processed automatically. Tohelp the LLM answer, a few examples called few shots can be included in theprompt. Moreover, the LLM can be asked to answer the question directly with theselected option or to first provide the reasoning and then the selected answer,which is known aschain of thought. In addition to checking whether theselected answer is correct, the evaluation can look at the LLM-estimatedprobability of its response as an indication of the confidence of the LLM inthe response. In this paper, we study how theLLM confidencein its answerdepends on whether the model has been asked to answer directly or to providethe reasoning before answering. The results of the evaluation of questions on awide range of topics in seven different models show that LLMs are moreconfident in their answers when they provide reasoning before the answer. Thisoccurs regardless of whether the selected answer is correct. Our hypothesis isthat this behavior is due to the reasoning that modifies the probability of theselected answer, as the LLM predicts the answer based on the input question andthe reasoning that supports the selection made. Therefore, LLM estimatedprobabilities seem to haveintrinsic limitationsthat should be understood inorder to use them in evaluation procedures. Interestingly, the same behaviorhas been observed in humans, for whom explaining an answer increases confidencein its correctness.


Reasoning Language Models: A Blueprint

  • 作者: Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler

  • 日期: 2025-01-20

  • 论文链接: https://arxiv.org/pdf/2501.11223

摘要

Reasoning language models (RLMs), also known as Large Reasoning Models(LRMs), such asOpenAI’s o1and o3,DeepSeek-V3, andAlibaba’s QwQ, haveredefined AI’s problem-solving capabilities by extending large language models(LLMs) with advanced reasoning mechanisms. Yet, their high costs, proprietarynature, and complex architectures - uniquely combining Reinforcement Learning(RL),search heuristics, and LLMs - present accessibility and scalabilitychallenges. To address these, we propose a comprehensive blueprint thatorganizes RLM components into a modular framework, based on a survey andanalysis of all RLM works. This blueprint incorporates diverse reasoningstructures (chains, trees, graphs, and nested forms), reasoning strategies(e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy,value modelsand others), and supervision schemes (Output-Based and Process-BasedSupervision). We also provide detailed mathematical formulations andalgorithmic specifications to simplify RLM implementation. By showing howschemes likeLLaMA-Berry, QwQ,Journey Learning, andGraph of Thoughtsfit asspecial cases, we demonstrate the blueprint’s versatility and unifyingpotential. To illustrate its utility, we introduce x1, a modular implementationfor rapid RLM prototyping and experimentation. Using x1 and a literaturereview, we provide key insights, such asmulti-phase trainingforpolicyandvalue models, and the importance offamiliar training distributions. Finally,we outline how RLMs can integrate with a broaderLLM ecosystem, including toolsand databases. Our work demystifies RLM construction, democratizes advancedreasoning capabilities, and fosters innovation, aiming to mitigate the gapbetween “rich AI” and “poor AI” by lowering barriers to RLM development andexperimentation.


Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

  • 作者: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji

  • 日期: 2025-01-20

  • 论文链接: https://arxiv.org/pdf/2501.11733

摘要

Smartphones have become indispensable in modern life, yet navigating complextasks on mobile devices often remains frustrating. Recent advancements in largemultimodal model (LMM)-basedmobile agentshave demonstrated the ability toperceive and act in mobile environments. However, current approaches facesignificant limitations: they fall short in addressing real-world human needs,struggle with reasoning-intensive andlong-horizontasks, and lack mechanismsto learn and improve from prior experiences. To overcome these challenges, weintroduce Mobile-Agent-E, ahierarchical multi-agent frameworkcapable ofself-evolutionthrough past experience. By hierarchical, we mean an explicitseparation ofhigh-level planningandlow-level action execution. The frameworkcomprises aManager, responsible for devising overall plans by breaking downcomplex tasks into subgoals, and four subordinate agents–Perceptor,Operator,Action Reflector, andNotetaker–which handle fine-grained visual perception,immediate action execution, error verification, and information aggregation,respectively. Mobile-Agent-E also features a novelself-evolution modulewhichmaintains a persistentlong-term memorycomprisingTipsandShortcuts.Tipsaregeneral guidance and lessons learned from prior tasks on how to effectivelyinteract with the environment.Shortcutsare reusable, executable sequences ofatomic operations tailored for specific subroutines. The inclusion ofTipsandShortcutsfacilitates continuous refinement in performance and efficiency.Alongside this framework, we introduceMobile-Eval-E, a new benchmark featuringcomplex mobile tasks requiringlong-horizon,multi-app interactions. Empiricalresults show that Mobile-Agent-E achieves a 22% absolute improvement overprevious state-of-the-art approaches across threefoundation model backbones.Project page: https://x-plug.github.io/MobileAgent.


VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

  • 作者: Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

  • 日期: 2025-01-16

  • 论文链接: https://arxiv.org/pdf/2501.09781

摘要

This work explores whether adeep generative modelcan learn complexknowledge solely from visual input, in contrast to the prevalent focus ontext-based models like large language models (LLMs). We develop VideoWorld, anauto-regressive video generation modeltrained on unlabeled video data, andtest itsknowledge acquisitionabilities in video-based Go and robotic controltasks. Our experiments reveal two key findings: (1)video-only trainingprovides sufficient information for learning knowledge, including rules,reasoning and planning capabilities, and (2) the representation of visualchange is crucial forknowledge acquisition. To improve both the efficiency andefficacy of this process, we introduce theLatent Dynamics Model (LDM)as a keycomponent of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professionallevel in theVideo-GoBenchwith just a 300-million-parameter model, withoutrelying onsearch algorithmsorreward mechanismstypical in reinforcementlearning. In robotic tasks, VideoWorld effectively learns diverse controloperations and generalizes across environments, approaching the performance oforacle models inCALVINandRLBench. This study opens new avenues for knowledgeacquisition from visual data, with all code, data, and models open-sourced forfurther research.


O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

  • 作者: Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao

  • 日期: 2025-01-22

  • 论文链接: https://arxiv.org/pdf/2501.12570

摘要

Recently,long-thought reasoningLLMs, such as OpenAI’s O1, adopt extendedreasoning processes similar to how humans ponder over complex problems. Thisreasoning paradigm significantly enhances the model’s problem-solving abilitiesand has achieved promising results. However,long-thought reasoningprocessleads to a substantial increase in inference time. A pressing challenge isreducing theinference overheadof long-thought LLMs while ensuring accuracy.In this paper, we experimentally demonstrate thatlong-thought reasoningmodelsstruggle to effectively allocatetoken budgetsbased on problem difficulty andreasoning redundancies. To address this, we propose Length-HarmonizingFine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead whilemaintaining accuracy. This effective fine-tuning method first estimates theLLM’s baseline performance throughpre-samplingand then uses RL-stylefine-tuning to encourage the model to generate shorter reasoning processesunder accuracy constraints. This allows the model to achieve efficientreasoning with lower redundancy while maintaining accuracy. Experiments onvarious mathematical reasoning benchmarks show thatO1-Prunernot onlysignificantly reducesinference overheadbut also achieves higher accuracy,providing a novel and promising solution to this challenge. Our code is comingsoon at https://github.com/StarDewXXX/O1-Pruner


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2342822.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Spring Boot知识点详解

打包部署 <!‐‐ 这个插件&#xff0c;可以将应用打包成一个可执行的jar包&#xff1b;‐‐> <build><plugins> <plugin> <groupId>org.springframework.boot</groupId><artifactId>spring‐boot‐maven‐plugin</artifactId&g…

DNS主从同步及解析

DNS 域名解析原理 域名系统的层次结构 &#xff1a;DNS 采用分层树状结构&#xff0c;顶级域名&#xff08;如.com、.org、.net 等&#xff09;位于顶层&#xff0c;下面是二级域名、三级域名等。例如&#xff0c;在域名 “www.example.com” 中&#xff0c;“com” 是顶级域名…

在Windows11上用wsl配置docker register 镜像地址

一、下载软件 1、下载wsl:安装 WSL | Microsoft Learn,先按照旧版 WSL 的手动安装步骤 | Microsoft Learn的步骤走 注:如果wsl2怎么都安装不下来,可能是Hyper-V没有打开,打开控制面板->程序和功能->启用或关闭Windows功能,勾选Hyper-V 如果Windows功能里面没有Hyp…

【Linux网络】构建UDP服务器与字典翻译系统

&#x1f4e2;博客主页&#xff1a;https://blog.csdn.net/2301_779549673 &#x1f4e2;博客仓库&#xff1a;https://gitee.com/JohnKingW/linux_test/tree/master/lesson &#x1f4e2;欢迎点赞 &#x1f44d; 收藏 ⭐留言 &#x1f4dd; 如有错误敬请指正&#xff01; &…

【PGCCC】Postgres 故障排除:修复重复的主键行

如何从表中删除不需要的重复行。这些重复行之所以“不需要”&#xff0c;是因为同一个值在指定为主键的列中出现多次。自从 glibc 好心地改变了排序方式后&#xff0c;我们发现这个问题有所增加。当用户升级操作系统并修改底层 glibc 库时&#xff0c;这可能会导致无效索引。 唯…

DeepSeek+Cursor+Devbox+Sealos项目实战

黑马程序员DeepSeekCursorDevboxSealos带你零代码搞定实战项目开发部署视频教程&#xff0c;基于AI完成项目的设计、开发、测试、联调、部署全流程 原视频地址视频选的项目非常基础&#xff0c;基本就是过了个web开发流程&#xff0c;但我在实际跟着操作时&#xff0c;ai依然会…

996引擎-拓展变量:物品变量

996引擎-拓展变量:物品变量 测试代码参考资料对于Lua来说,只有能保存数据库的变量才有意义。 至于临时变量,不像TXT那么束手束脚,通常使用Lua变量就能完成。 SELECT * FROM dbo.TBL_ITEM_EX_ABIL WHERE FLD_MAKEINDEX = 28620 <

【踩坑记录】stm32 jlink程序烧录不进去

最近通过Jlink给STM32烧写程序时一直报错&#xff0c;但是换一个其他工程就可以烧录&#xff0c;对比了一下jink配置&#xff0c;发现是速率选太高了“SW Device”&#xff0c;将烧录速率调整到10MHz以下就可以了

‌RISC-V低功耗MCU动态时钟门控技术详解

我来分享一下RISC-V核低功耗MCU的动态时钟门控技术实现&#xff1a; 这款MCU通过硬件级时钟门控电路实现了模块级的功耗管理。当外设&#xff08;如UART、SPI&#xff09;处于闲置状态时&#xff0c;系统会自动切断其时钟信号&#xff0c;减少无效翻转功耗。同时支持多电压域协…

工厂模式:解耦对象创建与使用的设计模式

工厂模式&#xff1a;解耦对象创建与使用的设计模式 一、模式核心&#xff1a;封装对象创建逻辑&#xff0c;客户端无需关心具体实现 在软件开发中&#xff0c;当创建对象的逻辑复杂或频繁变化时&#xff0c;直接在客户端代码中 new 对象会导致耦合度高、难以维护。例如&…

Python爬虫学习:高校数据爬取与可视化

本项目实现了从中国教育在线&#xff08;eol.cn&#xff09;的公开 API 接口爬取高校相关数据&#xff0c;并对数据进行清洗、分析与可视化展示。主要包括以下功能&#xff1a; 爬取高校基础信息及访问量数据数据清洗与格式转换多维度数据分析与可视化&#xff0c;如高校数量分…

触觉智能RK3506核心板,工业应用之RK3506 RT-Linux实时性测试

在工业自动化、机械臂控制等高实时性场景中&#xff0c;系统响应速度与稳定性直接决定设备效能。触觉智能RK3506核心板基于瑞芯微三核Cortex-A7架构深度优化&#xff0c;搭载Linux 6.1内核并支持Linux-RT实时系统&#xff0c;提供实时性能的高性价比解决方案。 RK3506与RT-Linu…

基于SpringBoot的高校体育馆场地预约管理系统-项目分享

基于SpringBoot的高校体育馆场地预约管理系统-项目分享 项目介绍项目摘要目录总体功能图用户实体图赛事实体图项目预览用户个人中心医生信息管理用户管理场地信息管理登录 最后 项目介绍 使用者&#xff1a;管理员 开发技术&#xff1a;MySQLJavaSpringBootVue 项目摘要 随着…

华为云获取IAM用户Token的方式及适用分析

&#x1f9e0; 一、为什么要获取 IAM 用户 Token&#xff1f; 我们用一个生活中的比喻来解释&#x1f447;&#xff1a; &#x1f3e2; 比喻场景&#xff1a; 你要去一个 高级写字楼&#xff08;华为云物联网平台&#xff09; 办事&#xff08;调用接口管理设备&#xff09;&…

如何利用快照与备份快速恢复服务器的数据

在服务器上利用**快照&#xff08;Snapshot&#xff09;**和**备份&#xff08;Backup&#xff09;**快速恢复数据&#xff0c;可显著减少停机时间并确保业务连续性。以下是具体操作步骤和最佳实践&#xff1a; --- ### **1. 快照&#xff08;Snapshot&#xff09;恢复** **适…

Git 详细使用说明文档(适合小白)

Git 详细使用说明文档&#xff08;适合小白&#xff09; 1. 什么是 Git&#xff1f; Git 是一个版本控制系统&#xff0c;帮助你管理和跟踪代码的变更。无论是个人项目还是团队协作&#xff0c;Git 都能帮助你记录代码的历史版本&#xff0c;方便回溯和协作。 2. 安装 Git …

Graph Database Self-Managed Neo4j 知识图谱存储实践1:安装和基础知识学习

Neo4j 是一个原生图数据库&#xff0c;这意味着它在存储层实现了真正的图模型。它不是在其他技术之上使用“图抽象”&#xff0c;而是以您在白板上绘制想法的相同方式在Neo4j中存储数据。 自2007年以来&#xff0c;Neo4j已经发展成为一个丰富的工具、应用程序和库的生态系统。…

一天学完Servlet!!!(万字总结)

文章目录 前言Servlet打印Hello ServletServlet生命周期 HttpServletRequest对象常用api方法请求乱码问题请求转发request域对象 HttpServletResponse对象响应数据响应乱码问题请求重定向请求转发与重定向区别 Cookie对象Cookie的创建与获取Cookie设置到期时间Cookie注意点Cook…

E3650工具链生态再增强,IAR全面支持芯驰科技新一代旗舰智控MCU

近日&#xff0c;全球嵌入式软件开发解决方案领导者IAR与全场景智能车芯引领者芯驰科技正式宣布&#xff0c;IAR Embedded Workbench for Arm已全面支持芯驰E3650&#xff0c;为这一旗舰智控MCU提供开发和调试一站式服务&#xff0c;进一步丰富芯驰E3系列智控芯片工具链生态&am…

MSSQL-数据库还原报错-‘32(另一个程序正在使用此文件,进程无法访问。)‘

这里是引用 标题: Microsoft SQL Server Management Studio 还原 对于 服务器“<<服务器名称>>”失败。 (Microsoft.SqlServer.SmoExtended) 有关帮助信息&#xff0c;请单击: http://go.microsoft.com/fwlink?ProdNameMicrosoftSQLServer&ProdVer12.0.2000.8…