大模型浪潮下的平台、框架、AI编译器和芯片架构

6月9日，2023北京智源大会，将邀请这一领域的探索者、实践者、以及关心智能科学的每个人，共同拉开未来舞台的帷幕，与会知名嘉宾包括，2019年图灵奖得主Yann LeCun、图灵奖得主Geoffrey Hinton、图灵奖得主Joseph Sifakis、诺贝尔奖得主Arieh Warshel、未来生命研究所创始人Max Tegmark、2021年科学突破奖得主David Baker、2022吴文俊最高成就奖得主郑南宁院士以及中国科学院张钹院士等。目前已正式开放大会线上报名渠道。大会将同步向全球线上直播。

北京智源大会倒计时：9天

AI系统专题论坛丨6月10日上午

随着AI系统框架和底层芯片软硬件平台技术的快速进步，AI大模型得到了强大的支持。创新的架构、优化的性能以及强大的平台支持，使AI大模型在计算速度、效率、扩展性及可用性等方面实现了显著的提升。看向未来，大模型的进一步发展将对计算能力提出更高的要求，这无疑将激发AI系统领域的研究者、开发者重新审视并应对来自平台和基础设施的压力。

本次论坛是关于AI系统领域的技术交流论坛，我们有幸邀请到来自全球知名团队的负责人和专家，包括Meta、Google、Nvidia、Anyscale以及国内智源研究院、中科院计算所等齐聚线下，共同分享该领域的最新成果。话题将涵盖Pytorch2.0、Ray等重要的框架技术，以及影响未来大模型和芯片领域的下一代AI编译器等热点。在大模型来临的时代，共同探讨AI系统和芯片领域的现状和挑战。这将是一次十分值得期待的AI系统技术盛会！

论坛议程

论坛主席丨主持人

林咏华，智源研究院副院长兼总工程师

林咏华现任北京智源人工智能研究院副院长兼总工程师, 主管大模型研究中心、人工智能系统及基础软件研究、产业生态合作等重要方向。IEEE女工程师亚太区领导组成员，IEEE女工程师协会北京分会的创始人。曾任IBM中国研究院院长，同时也是IBM全球杰出工程师，在IBM内部引领全球人工智能系统的创新。从事近20年的系统架构、云计算、AI系统, 计算机视觉等领域的研究。本人有超过50个全球专利，并多次获得ACM/IEEE最佳论文奖。获评2019年福布斯中国50位科技领导女性。

演讲主题及嘉宾介绍（按照发言先后顺序）

1、Making the Most of ML Hardware

议题简介：All numerical computations are not born equal. The gap widens, from peak performance numbers to the reality experienced by ML researchers. Model diversity and complexity increase, from an idealized batched matrix multiplication to the practice of serving transformer-based generative models. Despite decades of progress, peak performance is only achieved at the expense of a dramatic loss of programmability. Compilers are obviously part of the answer. But what compilers? Today's ML compilers offer “performance improvements”. But this is not enough: we want the best possible performance for a given hardware and model architecture. How do we achieve this? Without sacrificing the portability and reuse benefits of compiler infrastructure? We will discuss some of these issues and sketch a path to accelerating ML research, including model architecture search, with more “controllable compilers”.

Albert Cohen，Research Scientist，Google

Albert Cohen is a research scientist at Google. An alumnus of École Normale Supérieure de Lyon and the University of Versailles, he has been a research scientist at Inria from 2000 to 2018, a visiting scholar at the University of Illinois, an invited professor at Philips Research, and a visiting scientist at Facebook Artificial Intelligence Research. Albert works on parallelizing, optimizing and machine learning compilers, and on dataflow and synchronous programming languages, with applications to high-performance computing, artificial intelligence and reactive control.

2、PyTorch 2.0

议题简介：PyTorch 2.0 uses compilers to deliver faster training and inference without sacrificing the flexibility and ease of use of PyTorch. This talk will provide an overview of the technology stack behind the new torch.compile() API, discussing the key features of PyTorch 2.0, including its full backward compatibility and 43% faster model training. We will introduce various stack components, such as TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor, and how they work together to streamline the model development process. Attendees will gain a deeper understanding of the PyTorch 2.0 architecture and the benefits of incorporating compiler technologies into deep learning frameworks.

吴芃，Engineering Manager, PyTorch, Meta

Dr. Peng Wu is the engineering manager of the PyTorch Compiler team at Meta. She considered herself to have explored “widely” in the compilers and programming systems domain. Dr. Wu spent over a decade at IBM on a broad research portfolio ranging from accelerating dynamic languages to unleashing the power of novel hardware. She then founded the Programming Technologies Lab at Huawei and led its growth for five years. At Meta, she supported the team’s search for proper compiler solutions for PyTorch over the last three years. This effort resulted in a step-function change in PyTorch’s approach to execution efficiency, which led to the recent release of PyTorch 2.0 in March.

3、PyTorch 2.0 - custom-built for the era of Generative AI

议题简介：Pytorch is the open-source AI framework of choice for leading-edge AI research across the world. The recent release of PyTorch 2.0 represents a jump in capabilities that are critical to enable the era of Generative AI. The long-established developer-friendler user experience of PyTorch 2.0 is complemented with a combination of three new features that make PyTorch 2.0 uniquely optimized for training and inference of Transformer-based Large-Language Models and Diffuser models: Accelerated Transformers delivers the best training and inference performance with optimized fused-attention kernels; torch.compile brings transparent just-in-time compilation to AI frameworks; and fully-sharded data-parallel (FSDP) training enables scalable training for large models. Together with expanded and strengthened open-source governance, these new features make PyTorch 2.0 an ideal foundation for research and collaborative innovation fueling the era of Generative AI.

Michael Gschwind，Director，Meta Platforms

Michael Gschwind leads AI Acceleration at Meta Platforms. Prior to joining Meta, he was VP and Chief Architect of Accelerators and AI at Huawei and Chief Engineer for AI at IBM where he created and led IBM’s first AI hardware and software products. Michael Gschwind is a recognized leader in high-performance computing, inventor of general-purpose programmable accelerators first released with Cell, and chief architect of three TOP500 #1 world’s fastest supercomputers. Michael has been recognized as IEEE Fellow and is one of the most prolific innovators in history as co-inventor of over 800 patents.

4、Ray: Distributed Compute Framework For Scalable AI

议题简介：Ray is a powerful open source framework for development and deployment of scalable and distributed ML workloads. This talk discusses the history and fundamentals of Ray. Why it is able to lift a lot the challenges users face when building distributed ML systems. We will also discuss a couple of use cases from our lighthouse industry partners, and where we see the field is going.

Jun Gong，ML Tech Lead, Anyscale

Jun is the tech lead of the ML team at Anyscale Inc, the company behind popular open source framework Ray. He leads Anyscale's research and development work in Deep Learning, Reinforcement Learning, and Generative AI technilogies. Before Anyscale, Jun is a scilicone valley veteran with a decorated career at companies like Google and Meta. He is also a co-author of a Nature publication on controlling high-altitude balloons using Reinforcement Learning.

5、面向多种异构算力的AI编译器技术初探

议题简介：异构芯片是AI算力的核心，但由于软硬件适配工作量巨大，相对新硬件的推出，框架支持往往滞后，影响开发者和最终用户对新增算力的实际使用，AI编译器是用于解决软硬件自动化适配的关键技术，但作为普适技术，目前的主流实现仍以面向特定软硬件平台为服务目标，难以解决新硬件厂商的适配难题，针对该痛点，本次汇报将探讨面向未来算力适配需求的后端中立AI编译器技术，并介绍智源近半年来在AI编译器领域的初步进展。

白童心，智源研究院AI编译器负责人

目前研究兴趣：深度学习编译器与深度学习框架，曾就职于百度PaddlePaddle、西门子中国研究院、中科院深圳先进技术研究院、H2O.ai、Teradata，罗切斯特大学计算机科学博士。

6、基于NVDIA Megatron更高效地构建大语言模型训练系统

议题简介：大语言模型是当今最重要的先进技术之一，英伟达开发的基于PyTorch的训练框架 - Megatron与NeMo Framework的开创性技术，能够提高训练速度且可扩展性强，使得大模型的训练和部署变得更加容易，具有更短的时间成本和更快的结果。本议题将深度解析英伟达GPU软硬件结合的技术创新如何推动大模型技术的发展演进，英伟达Megatron框架的优化方法，包括其高效的并行策略、内存优化技术等，以及如何利用该框架在大模型时代提高模型开发效率和模型质量。

赖俊杰，NVIDIA 高级总监

赖俊杰博士，现任NVIDIA工程和解决方案高级总监，主要负责带领团队对接中国区核心大客户的技术需求，以及开发相应的行业解决方案，定制软件产品等。赖博士本科和硕士就读于清华大学电子工程系，并于法国国家计算机与自动化研究所获得博士学位。博士期间的主要的研究方向包括GPU架构研究，及GPU性能分析模型。赖博士专注在高性能计算、并行计算、人工智能及互联网行业应用。在并行程序的性能分析，及性能优化领域有丰富的经验。

7、面向大模型的算子自动调优技术

议题简介： 算子自动调优是高性能计算、编译优化及深度学习等领域的研究热点之一。深度学习算子调优是提升深度学习算法运行效率的有效手段。在大模型时代，随着专用加速硬件的广泛应用，算法类型和网络规模的快速增长，给深度学习算子调优带来了更多挑战。报告将介绍面向专用硬件的算子自动调优及算子调优加速等技术，为提升深度学习算子的执行效率和编译效率提供参考。

郭崎，中科院计算所研究员、博导

长期从事计算机体系结构及人工智能相关研究。在相关领域顶级会议及期刊上共发表论文70余篇，申请（授权）专利百余项。入选国家级青年拔尖人才、中科院青年科学家奖、中科院青促会优秀会员等。曾获得国家自然科学二等奖、中科院杰出科技成就奖等。

8、九鼎：支撑大模型训练及芯片架构多样性

议题简介：随着基础大模型参数规模的不断提升，AI系统和平台面临着越来越多的挑战。这种“暴力美学”的背后，需要大量的系统和软件工程能力来支撑。同时，大模型的新型研发范式也需要相应的平台能力来支持。本次报告中，我们将向大家介绍智源研究院的九鼎平台在基础大模型训练方面的一系列支撑工作，以及应用多种国产异构芯片支持大模型任务的探索工作。此外，我们还将探讨对AI系统与平台未来发展的一些思考。

田燕，智源研究院AI系统平台研发负责人

负责AI系统与平台软件研发相关工作。拥有多年Linux操作系统、分布式计算与存储、人工智能平台等基础设施相关领域的研发与团队管理经验。曾任百度技术经理、快手机器学习平台负责人以及平台架构团队负责人，主导了快手机器学习平台从零开始到统一支撑公司多业务线的建设，并负责分布式存储、网络产品研发、操作系统等多个技术团队。

9、基于高性能通用GPU打造国产大模型训练系统

议题简介：ChatGPT的成功是工程与算法的协同创新，也是系统级的产品创新。GPT作为ChatGPT的基座，其大规模分布式训练在模型的参数规模、算力规模和训练性能等维度都存在巨大的挑战。本次报告主要分享介绍GPT大模型的分布式并行训练策略、如何基于国产大算力通用GPU打造大模型训练系统及建设大模型的产业生态。

丁云帆，现任壁仞科技系统架构副总裁

负责AI软件架构相关工作，曾担任百度主任系统架构师，获得过百度技术最高奖和中国国家专利优秀奖。参与主导AI加速基础架构国际标准OAI & OAM。业界首创利用GPU架构解决广告推荐场景10TB级稀疏模型训练挑战，相关成果发表在机器学习与系统领域顶会MLSys上，该工作目前引领了互联网广告推荐领域训练框架技术发展趋势。

10、大模型时代的算力机遇和实践

议题简介：从计算产业的发展周期来看，AI芯片是产业发展的必然趋势。AI时代，计算量需求不断增加，AI芯片应运而生，尤其是在大模型时代，AI芯片提供了核心的底层算力。昆仑芯围绕大模型打造出多层级产品矩阵，可支持十亿、百亿及千亿以上模型，性能领先于业界主流GPU。针对大模型应用，昆仑芯提供了软硬一体的技术栈，包括昆仑芯XFT大模型推理加速库、框架层等。

目前，昆仑芯已在百度搜索等业务场景落地，扎实部署数万片，为百度搜索的在线场景与离线场景提供底层算力支撑。百度之外，昆仑芯可为客户提供大模型定制服务，凭借强大算力为大模型落地提供全流程支持。

罗航，昆仑芯研发总监

中国科学技术大学硕士，原百度资深架构师，拥有十余年互联网产品技术研发和管理经验。现任昆仑芯研发总监，目前负责AI解决方案研发和商业化落地，致力于通过安全可靠的高性能AI芯片和解决方案，为各行各业的数字化、智能化产业变革贡献力量。AI芯片解决方案经验丰富，曾多次参与AI芯片国家技术标准制定并出席行业技术峰会。

11、面向大模型训练的高并发文件访问深度优化

议题简介： 人工智能大模型的训练对存储系统提出了高带宽、超低延迟、高可靠性和租户管理等多方面的挑战。首先，训练数据通常非常庞大，可能涉及数百亿、甚至数千亿的样本，需要存储系统具备足够的容量来存储和高效访问这些数据。其次，训练需要进行大量的计算，产生大量的中间结果和模型参数，需要及时地存储和读取。此外，训练通常是分布式进行的，需要存储系统能够提供高可用性和容错性，并且能支持大量GPU/CPU客户端并发读写访问存储系统。面向大模型训练过程中的上述挑战，中国电信天翼云并行存储开展诸多深度优化探索。