A new version of ResearchHub is available.Try it now
Post
Document
Flag content
6

AIGC Cutting Edge Research Report (2024/May)

Authors
Published
May 7, 2024
Save
TipTip
Document
Flag content
6
TipTip
Save
Document
Flag content

This post is for anyone on RH interested in AI related research. Below is the most recent published 10 AIGC projects I collected, feel free to discuss any get your interested! I might miss some, if so please comments below so I can added on in the future!

1. Predict multiple tokens simultaneously: better and faster large language models

Currently, large language models (LLMs) such as GPT and Llama are trained with next token prediction losses.

In this work, the research team from Meta FAIR believes that training a language model to predict multiple tokens at the same time can improve sampling efficiency. More specifically, at each position in the training corpus, they require n independent output heads to predict the following n tokens based on a shared model backbone. Treating multi-token prediction as an auxiliary training task, they measured the improved downstream capabilities of code models and natural language models without training time overhead.

This approach is more efficient for larger model sizes and maintains this efficiency when training for multiple epochs. The model's advantages are particularly evident on generative benchmarks such as coding, consistently outperforming strong benchmarks by several percentage points. Compared to similar next-token models, their 13B parameter model solves 12% more problems on HumanEval and 17% more problems on MBPP.

Experiments on small algorithmic tasks show that multi-token prediction is beneficial to the development of induction heads and algorithmic reasoning capabilities. An additional benefit is that models trained with 4-token predictions can achieve 3x faster inference even at large batch sizes.

Reference: https://arxiv.org/abs/2404.19737

A screenshot of a graph

Description automatically generated

2.InstantFamily: Zero-sample multi-identity image generation

Currently, in the field of personalized image generation, the ability to create images that preserve concepts has greatly improved. It can still be challenging to create a visually appealing image that naturally blends multiple concepts together.

SK Telecom proposes InstantFamily - employing a novel masked cross-attention mechanism and multi-modal embedding stack to achieve zero-shot multi-ID image generation. Their method utilizes global and local features from a pre-trained face recognition model combined with text conditions, thereby effectively preserving IDs.

Furthermore, their masked cross-attention mechanism enables precise control of multiple IDs and compositions in generated images. Experiments show its advantages in generating multi-ID images while solving the multi-ID generation problem. Additionally, the model achieves state-of-the-art performance in both single-ID and multi-ID saving, and the model also demonstrates significant scalability when saving a larger number of IDs than initially trained.

Reference: https://arxiv.org/abs/2404.19427

A group of men posing for a photo

Description automatically generated

3.Meta proposes “iterative reasoning preference optimization”

Recent studies have shown that iterative preference optimization methods perform well in general instruction fine-tuning tasks, but often provide little improvement in inference tasks.

A team of researchers at Meta and NYU developed an iterative approach to optimizing preferences among competitively generated Chains of Thought (CoT) candidates by optimizing the winning and losing reasoning steps that lead to the correct answer. They trained using a modified DPO loss and added a negative log-likelihood term.

The results show that reasoning ability improves in repeated iterations of the scheme. Although relying only on examples in the training set, this method improves the accuracy of Llama-2-70B-Chat on GSM8K from 55.6% to 81.6% (88.7% accuracy using majority voting in 32 samples), on GSM8K The accuracy on MATH increases from 12.5% ​​to 20.8%, and on ARC-Challenge from 77.8% to 86.7%, surpassing other Llama-2-based models that do not rely on additional datasets.

Reference: https://arxiv.org/abs/2404.19733

A diagram of a complex function

Description automatically generated with medium confidence

4.SPPO: Large model alignment method based on self-game

Traditional reinforcement learning with human feedback (RLHF) methods rely on parametric models such as the Bradley-Terry model, which cannot well capture the instability and irrationality of human preferences. Recent advances have shown that using preference probabilities directly can more accurately reflect human preferences, allowing for more flexible and accurate language model alignment.

A research team from UCLA and Carnegie Mellon University proposed a language model alignment method SPPO based on self-game, which treats the problem as a constant-sum two-player game and aims to determine the Nash equilibrium strategy. It approximates Nash equilibrium through iterative policy updates and has theoretical convergence guarantees. This method can effectively improve the log-likelihood of selected strategies and reduce the log-likelihood of rejected strategies, which cannot be achieved by symmetric pairwise loss methods such as direct preference optimization (DPO) and identity preference optimization (IPO).

Experiments show that SPPO only uses 60k tips (excluding replies) from the UltraFeedback dataset, and does not perform any tip enhancement. By utilizing the pre-trained preference model PairRM with only 0.4B parameters, it can be fine-tuned from Mistral-7B- A model was obtained in Instruct-v0.2, which achieved a length control winning rate SOTA (28.53%) compared with GPT-4-Turbo on AlpacaEval 2.0. It also outperforms (iterative) DPO and IPO on MT-Bench and Open LLM Leaderboard. It is worth noting that SPPO’s strong performance is achieved without additional external supervision (such as preferences, etc.) from GPT-4 or other more powerful language models.

A graph with numbers and lines

Description automatically generated with medium confidence

Reference: https://arxiv.org/abs/2405.00675

5. Byte and Nankai team StoryDiffusion: improve the consistency of image and video generation

It is a huge challenge for state-of-the-art diffusion-based generative models to maintain content consistency across a series of generated images, especially those containing topics and complex details.

A research team from Nankai University and ByteDance proposed a new self-attention calculation method - Consistent Self-Attention, which can significantly improve the consistency between generated images and achieve zero-sample way to enhance diffusion-based pre-trained text-to-image models.

To extend this method to long video generation, they further proposed a novel semantic spatial-temporal motion prediction module named “Semantic Motion Predictor”. The module is trained to estimate motion between two provided images in semantic space. This module converts generated image sequences into videos with smooth transitions and consistent subjects, and is significantly more stable than modules based solely on latent space, especially in the case of generating long videos.

Furthermore, by merging these two novel components, the framework StoryDiffusion can describe text-based stories with consistent images or videos containing rich content.

Reference: https://arxiv.org/abs/2405.01434

GitHub: https://github.com/HVision-NKU/StoryDiffusion

A collage of images of men and women

Description automatically generated

6. Custom text-to-image model using “single image pair”

Artistic reinterpretation refers to the creation of variations on reference works so that the paired artworks exhibit a unique artistic style. However, can such image pairings be used to tailor generative models to capture the stylistic differences exhibited?

A research team from Carnegie Mellon University and Northeastern University proposed a new customization method - Pair Customization, which can learn style differences from a single image pair and then apply the obtained style to the generation process middle. Unlike existing methods that learn to imitate a single concept from a collection of images, this method captures stylistic differences between paired images. This allows them to apply stylistic changes without over-adapting to the specific image content in the example.

To accomplish this new task, they employ a joint optimization approach that explicitly separates style and content into different LoRA weight spaces. They optimize these style and content weights to reproduce style and content images.

During inference, they modify the diffusion process through new style guidance based on the learned weights. Both qualitative and quantitative experiments demonstrate that their approach can effectively learn style while avoiding overfitting to image content, highlighting the potential to model such style differences from a single image pair.

Reference: https://arxiv.org/abs/2405.01536

GitHub: https://paircustomization.github.io/

A screenshot of a diagram of a dog

Description automatically generated

7. Meta new research: achieving efficient training of language models

Currently, the training of language models (LMs) relies on computationally expensive training on massive data sets, which makes the training process extremely laborious. A research team from Meta FAIR proposes a new method to numerically assess text quality in large unlabeled NLP datasets in a model-agnostic way, assigning “quality scores” to text instances.

By proposing a text quality metric, they established a framework to identify and eliminate low-quality text instances, thereby improving the training efficiency of LM models. Experimental results on multiple models and datasets demonstrate the effectiveness of this approach, demonstrate substantial improvements in training performance, and highlight the potential for resource-efficient LM training.

For example, when training on the OpenWebText dataset, they observed a 0.9% increase in average absolute accuracy for multiple LM models across 14 downstream evaluation tasks, while using 40% less data and increasing training speed by 42%; When training on the Wikipedia dataset, the average absolute accuracy improved by 0.8% while using 20% ​​less data and training 21% faster.

Reference: https://arxiv.org/abs/2405.01582

A screenshot of a graph

Description automatically generated

8. Beyond GPT-4V, the Tsinghua team launches an open platform for embodied intelligence

Despite advances in large language models (LLMs) and large multimodal models (LMMs), the work of integrating them into language-based, human-like embodied agents remains unfinished, hampering the ability to integrate them into physical environments Execution of complex real-world tasks. Existing integrations often feature limited open source code, posing a challenge to overall progress in the field.

Research teams from Tsinghua University and Central South University proposed an open and extensible platform - LEGENT for developing embodied agents using LLM and LMM. LEGENT offers a dual approach: a rich interactive 3D environment with communicative and operable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline leveraging advanced algorithms to leverage data from simulated worlds at scale. supervision.

Experimental results show that a prototype vision-language-action model trained on data generated by LEGENT surpasses GPT-4V in embodied tasks and demonstrates good generalization capabilities.

Reference: https://arxiv.org/abs/2404.18243

GitHub: https://github.com/thunlp/LEGENT

 

A screenshot of a video game

Description automatically generated

9.Cohere proposes a new evaluation method: replacing large models with multiple small models

As large language models (LLMs) become more and more powerful, the industry is no longer able to accurately evaluate them. Not only is it difficult to find data that adequately evaluates specific model properties, but it is also challenging to evaluate only the correctness of a model's free-form generation.

To address this issue, many existing reviews rely on using LLMs as “judges” to score the output quality of other LLMs. The most common evaluation method is to use a single large model (such as GPT4). While this approach is growing in popularity, it is costly and has been shown to introduce intramodel bias.

In this work, the Cohere team found that large models are often unnecessary. They recommend using the Panel for LLM Evaluation (PoLL) to evaluate the model. Across three different evaluation settings and six different datasets, they found that using PoLL, which is composed of more smaller models, performs better than using a single large estimator, which consists of a family of unrelated models. The result is smaller in-model bias and more than 7 times lower cost.

Reference: https://arxiv.org/abs/2404.18796

A graph with colorful lines and dots

Description automatically generated

10.Meta launches AdvPrompter, which generates human-readable adversarial prompts 800 times faster

Large language models (LLMs) have achieved impressive results recently, but they are vulnerable to certain jailbreaking attacks that lead to the generation of inappropriate or harmful content.

To conduct artificial red teaming, it is necessary to find adversarial cues that lead to such jailbreaking behavior, such as adding suffixes to a given command, which is inefficient and time-consuming. On the other hand, automatically generating adversarial hints often leads to semantically unsemantic attacks that are easily detected by fallibility-based filters that may require gradient information from TargetLLM or due to the time-consuming discrete optimization process over the token space. Doesn't scale well at times.

In this study, the Meta team proposes a new method that uses an LLM called AdvPrompter to generate human-readable adversarial prompts in seconds, 800 times faster than existing optimization-based methods.

They use a new algorithm that does not require access to TargetLLM gradients to train AdvPrompter. This process alternates between two steps: 1) generating high-quality target adversarial suffixes by optimizing AdvPrompter predictions; 2) low-level fine-tuning of the AdvPrompter using the generated adversarial suffixes. The trained AdvPrompter generates suffixes that mask instructions without changing the meaning of the input instructions, thus tricking TargetLLM into harmful responses. Experimental results on open-source TargetLLMs show state-of-the-art results on the AdvBench dataset, which also applies to closed-source LLM APIs.

Furthermore, they demonstrate that by fine-tuning on synthetic datasets generated by AdvPrompter, LLM can be made more resilient to jailbreaking attacks while maintaining performance (i.e., high MMLU scores).

Reference: https://arxiv.org/abs/2404.16873

A diagram of a process

Description automatically generated with medium confidence

100%
Discussion


Start the discussion.
This post has not yet been discussed.