Research Directions to Watch in the Field of Big Model Infrastructure (Training, Inference, Hardware) in 2024

Artificial Intelligence

Authors

zihao wang

Published

Apr 28, 2024

Save

Tip

Document

Flag content

Tip

Save

Document

Flag content

In the field of artificial intelligence, the development and application of large-scale models is rapidly changing the way we understand and apply technology. As models get larger, the need for infrastructure grows. The following section explores noteworthy research directions in the area of large model infrastructure (including training, inference, and hardware) for 2024.

Training

Efficiency: With the prevalence of Transformer-based macromodels, it has become almost impossible to make big breakthroughs, and arithmetic optimisation + parallel optimisation have basically maxed out the MFUs of SOTA GPUs with the available cross-card interconnect bandwidth. The rollout of new hardware may lead to some breakthrough points.

Ease of use: automatic parallelism caught fire in 2022, with several representative efforts including Ray+JAXbased star project Alpa [OSDI'22], automatic parallelism "titan" FlexFlow upgraded Unity [OSDI'22], main PyTorch-compatible and Transformer models with Galvatron [VLDB'23]. This fire burned all the way to 2023, a year in which quite a lot of improved work actually hung out on the arXiv, such as various co-optimisations on top of the original auto-parallelism, but it doesn't seem to have gained much traction either. It is estimated that there will still be some relevant papers appearing in the top conference in 2024, but the attention may not be great.

Stability/Cost: Cluster fault tolerance is a hot topic in 2023 and a real need for large enterprises to build big models. There is also a highly related direction, elastic computing, that is getting a lot of attention. Early work includes Varuna [EuroSys'22] (Best Paper) and Bamboo [NSDI'23], and this year has seen the emergence of Competitors such as Gemini [SOSP'23] Parcae [NSDI'24] and Oobleck [SOSP'23]. These directions may get more attention in 2024.
Small sample learning: large models usually require large amounts of data to train. However, for many specific domains, access to large amounts of labelled data can be very difficult. Therefore, investigating how to train large models using small amounts of data is an important research direction.
Interpretability and Transparency: as models become more complex, it becomes increasingly difficult to understand their decision-making process. Therefore, it becomes crucial to investigate how to improve the interpretability and transparency of models so that users can understand and trust their decisions.

Reasoning

Efficiency: Incremental decoding has changed some of the algorithmic implementation mechanisms in LLM, but after a full year of kernel optimisation in 23, hardware performance has also been squeezed almost to the limit (mainly in terms of access bandwidth). Looking ahead to 2024, it is questionable how much better than the current SOTA scheme we can do in terms of system efficiency, especially end-to-end inferencelatency, which is a single metric. One possible direction is speculative decoding; another is to optimise around other performance metrics such as throughput and TTFT/TPOT, there should still be some room for doing so; and if we can sacrifice some modeling effects and do not guarantee strict alignment of the outputs, there is even more that can be done. For more details, please refer to the review paper:

https://doi.org/10.48550/arXiv.2312.15234

Ease of use: there doesn't seem to be much interesting stuff in the inference side of automatic parallelism, and mainstream frameworks can already support various parallel strategies (e.g., TP/PP) very well. Personally, I feel that as the complexity of large model usage scenarios gradually increases, the future may be more worthy of attention is how to better combine the inference engine with the whole set of AI Service product Pipeline to improve the overall deployment and service efficiency.
Stability/System Cost: SpotServe [ASPLOS'24], emerging in late 2023, is the first LLM inference system for preemptible clusters, dramatically lowering inference cost through cheap SpotInstance, and pioneering elastic compute/cluster fault-tolerance + LLM inference. Looking ahead to 2024, more related work or products are likely to emerge, both in academia (e.g., SkyLab) and industry, such as AnyScale already doing Spot Instance support.

On the other hand, there are still a lot of things that can be done with LLM reasoning, but due to the low barrier to entry, people have to be "fast". One interesting example is that at the end of 2009, several teams working on multi-LoRAserving at the same time almost released their papers and systems at the same time, which to a certain extent also shows the degree of competition at present.

Hardware

Customised AI chips: as the scale of models increases, traditional general-purpose computing chips may not be able to meet the demand. Therefore, research on customised AI chips, specifically designed for training and inference of large models, has become an important research direction.
Heterogeneous computing: using different types of hardware (e.g., CPU, GPU, TPU, etc.) to jointly handle model training and inference can improve efficiency. Therefore, research on how to effectively utilise heterogeneous computing resources is a direction of interest.
New storage technology: large models require a large amount of storage space to store model parameters and training data. Therefore, research on new storage technologies, such as non-volatile memory, can improve the efficiency of model training and inference.

100%

Discussion

Start the discussion.

This post has not yet been discussed.

Research Directions to Watch in the Field of Big Model Infrastructure (Training, Inference, Hardware) in 2024

Scan to connect with one of our mobile apps

Coinbase Wallet app

Coinbase app

Or try the Coinbase Wallet browser extension