An Open Supply Unified Language Learner – Google AI Weblog

Posted by Yi Tay and Mostafa Dehghani, Analysis Scientists, Google Analysis, Mind Group

Constructing fashions that perceive and generate pure language properly is one the grand objectives of machine studying (ML) analysis and has a direct influence on constructing sensible programs for on a regular basis functions. Enhancing the standard of language fashions is a key goal for researchers to make progress towards such a objective.

Commonest paradigms to construct and prepare language fashions use both autoregressive decoder-only architectures (e.g., PaLM or GPT-3), the place the mannequin is skilled to foretell the subsequent phrase for a given prefix phrase, or span corruption-based encoder-decoder architectures (e.g., T5, ST-MoE), the place the coaching goal is to recuperate the subset of phrases masked out of the enter. On the one hand, T5-like fashions carry out properly on supervised fine-tuning duties, however wrestle with few-shot in-context studying. Then again, autoregressive language fashions are nice for open-ended era (e.g., dialog era with LaMDA) and prompt-based studying (e.g., in-context studying with PaLM), however might carry out suboptimally on fine-tuning duties. Thus, there stays a possibility to create an efficient unified framework for pre-training fashions.

In “Unifying Language Studying Paradigms”, we current a novel language pre-training paradigm known as Unified Language Learner (UL2) that improves the efficiency of language fashions universally throughout datasets and setups. UL2 frames totally different goal features for coaching language fashions as denoising duties, the place the mannequin has to recuperate lacking sub-sequences of a given enter. Throughout pre-training it makes use of a novel mixture-of-denoisers that samples from a different set of such targets, every with totally different configurations. We exhibit that fashions skilled utilizing the UL2 framework carry out properly in a wide range of language domains, together with prompt-based few-shot studying and fashions fine-tuned for down-stream duties. Moreover, we present that UL2 excels in era, language understanding, retrieval, long-text understanding and query answering duties. Lastly, we’re excited to publicly launch the checkpoints for our greatest performing UL2 20 billion parameter mannequin.

Background: Language Modeling Aims and Architectures

Widespread goal features for coaching language fashions can principally be framed as studying knowledge transformations that map inputs to targets. The mannequin is conditioned on totally different types of enter to foretell goal tokens. To this finish, totally different targets make the most of totally different properties of the inputs.

The usual Causal Language modeling goal (CausalLM) is skilled to foretell full sequence lengths and so, solely acknowledges tokens within the goal output. The prefix language modeling goal (PrefixLM) modifies this course of by randomly sampling a contiguous span of okay tokens from the given tokenized textual content to type the enter of the mannequin, known as the “prefix”. The span corruption goal masks contiguous spans from the inputs and trains the mannequin to foretell these masked spans.

Within the desk beneath, we listing the frequent targets on which state-of-the-art language fashions are skilled together with totally different traits of the enter, i.e., how it’s offered to the mannequin. Furthermore, we characterize the instance effectivity of every goal by way of the flexibility of the mannequin for exploiting supervision indicators from a single enter, e.g., how a lot of the enter tokens contribute to the calculation of the loss.

Goal Perform	Inputs (Bi-directional)	Targets (Causal)	Enter Properties	Instance Effectivity

CausalLM	none	textual content	N/A	full seq_len

PrefixLM	textual content (as much as place okay)	textual content (after place okay)	contiguous	seq_len – okay

Span corruption	masked textual content	masked_tokens	non-contiguous, could also be bi-directional	usually decrease than others

Widespread targets utilized in in the present day’s language fashions. All through, “textual content” signifies tokenized textual content.

UL2 leverages the strengths of every of those goal features via a framework that generalizes over every of them, which allows the flexibility to purpose and unify frequent pre-training targets. Based mostly on this framework, the primary process for coaching a language mannequin is to be taught the transformation of a sequence of enter tokens to a sequence of goal tokens. Then all the target features launched above will be merely decreased to other ways of producing enter and goal tokens. As an illustration, the PrefixLM goal will be seen as a metamorphosis that strikes a phase of okay contiguous tokens from the inputs to the targets. In the meantime, the span corruption goal is an information transformation that corrupts spans (a subsequence of tokens within the enter), changing them with masks tokens which can be shifted to the targets.

It’s price noting that one can decouple the mannequin structure and the target operate with which it’s skilled. Thus, it’s potential to coach totally different architectures, such because the frequent single stack decoder-only and two-stack encoder-decoder fashions, with any of those targets.

Combination of Denoisers

The UL2 framework can be utilized to coach a mannequin on a combination of pre-training targets and provide it with capabilities and inductive bias advantages from totally different pre-training duties. Coaching on the combination helps the mannequin leverage the strengths of various duties and mitigates the weaknesses of others. As an illustration, the mixture-of-denoisers goal can strongly enhance the prompt-based studying functionality of the mannequin versus a span corruption-only T5 mannequin.

UL2 is skilled utilizing a combination of three denoising duties: (1) R-denoising (or common span corruption), which emulates the usual T5 span corruption goal; (2) X-denoising (or excessive span corruption); and (3) S-denoising (or sequential PrefixLM). Throughout pre-training, we pattern from the accessible denoising duties based mostly on user-specified ratios (i.e., totally different combos of the R, X, and S-denoisers) and put together the enter and goal appropriately. Then, a paradigm token is appended to the enter (considered one of [R], [X], or [S]) indicating the denoising process at hand.

An outline of the denoising targets utilized in UL2’s mixture-of-denoisers.

Enhancing Commerce-Offs Throughout Studying Paradigms

Many current generally used language studying paradigms usually excel at one kind of process or software, similar to fine-tuning efficiency or prompt-based in-context studying. Within the plot beneath, we present baseline goal features on totally different duties in comparison with UL2: CausalLM (known as GPT-like), PrefixLM, Span Corrupt (additionally known as T5 within the plot), and a baseline goal operate proposed by UniLM. We use these targets for coaching decoder solely architectures (inexperienced) and encoder-decoder architectures (blue) and consider totally different combos of goal features and architectures on two important units of duties:

High-quality-tuning, by measuring efficiency on SuperGLUE (y-axis of the plot beneath)
In-context studying, by measuring efficiency of the mannequin on a collection of 1-shot GEM duties (e.g., XSUM, SGD or Schema guided dialog and TOTTO) (x-axis of the plot beneath).

For a lot of the current language studying paradigms, there’s a trade-off between the standard of the mannequin on these two units of duties. We present that UL2 bridges this trade-off throughout in-context studying and fine-tuning.

In each decoder-only and encoder-decoder setups, UL2 strikes a considerably improved steadiness in efficiency between fine-tuned discriminative duties and prompt-based 1-shot open-ended textual content era in comparison with earlier strategies. (All fashions are comparable by way of computational prices, i.e., FLOPs (EncDec fashions are 300M and Dec fashions are 150M parameters).

UL2 for Few-Shot Prompting and Chain-of-Thought Reasoning

We scale up UL2 and prepare a 20 billion parameter encoder-decoder mannequin on the general public C4 corpus and exhibit some spectacular capabilities of the UL2 20B mannequin.

UL2 is a strong in-context learner that excels at each few-shot and chain-of-thought (CoT) prompting. Within the desk beneath, we evaluate UL2 with different state-of-the-art fashions (e.g, T5 XXL and PaLM) for few-shot prompting on the XSUM summarization dataset. Our outcomes present that UL2 20B outperforms PaLM and T5, each of that are in the identical ballpark of compute price.

Mannequin	ROUGE-1	ROUGE-2	ROUGE-L
LaMDA 137B	–	5.4	–
PaLM 62B	–	11.2	–
PaLM 540B	–	12.2	–
PaLM 8B	–	4.5	–
T5 XXL 11B	0.6	0.1	0.6
T5 XXL 11B + LM	13.3	2.3	10.7
UL2 20B	25.5	8.6	19.8

Comparability of UL2 with T5 XXL, PaLM and LamDA 137B on 1-shot summarization (XSUM) by way of ROUGE-1/2/L (increased is healthier), which captures the standard by evaluating the generated summaries with the gold summaries as reference.

Most CoT prompting outcomes have been obtained utilizing a lot bigger language fashions, similar to GPT-3 175B, PaLM 540B, or LaMDA 137B. We present that reasoning by way of CoT prompting will be achieved with UL2 20B, which is each publicly accessible and a number of other instances smaller than prior fashions that leverage chain-of-thought prompting. This permits an open avenue for researchers to conduct analysis on CoT prompting and reasoning at an accessible scale. Within the desk beneath, we present that for UL2, CoT prompting outperforms customary prompting on math phrase issues with a spread of difficulties (GSM8K, SVAMP, ASDiv, AQuA, and MAWPS). We additionally present that self-consistency additional improves efficiency.

Chain-of-thought (CoT) prompting and self-consistency (SC) outcomes on 5 arithmetic reasoning benchmarks.

Conclusion and Future Instructions

UL2 demonstrates superior efficiency on a plethora of fine-tuning and few-shot duties. We publicly launch checkpoints of our greatest performing UL2 mannequin with 20 billion parameters, which we hope will encourage sooner progress in creating higher language fashions within the machine studying group as a complete.

Acknowledgements

It was an honor and privilege to work on this with Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Gained Chung, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby and Donald Metzler. We additional acknowledge Alexey Gritsenko, Andrew M. Dai, Jacob Devlin, Jai Gupta, William Fedus, Orhan Firat, Sebastian Gerhmann, Nan Du, Dave Uthus, Siamak Shakeri, Slav Petrov and Quoc Le for assist and discussions. We thank the Jax and T5X crew for constructing such great infrastructure that made this analysis potential.