[๋…ผ๋ฌธ๋ฆฌ๋ทฐ๐Ÿ“„] Language Models are Few-Shot Learners (GPT-3)

 

๋“ค์–ด๊ฐ€๊ธฐ์— ์•ž์„œ

ํ•ด๋‹น ํฌ์ŠคํŠธ๋Š” ๋…ผ๋ฌธ๊ณผ ๋‹ค์Œ์˜ 2๊ฐœ์˜ ์œ ํŠœ๋ธŒ ์˜์ƒ์„ ๋ณด๋ฉฐ ์ •๋ฆฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์˜ ๋‚ด์šฉ์„ ๊ธฐ๋ณธ์œผ๋กœ ํ–ˆ์ง€๋งŒ, ํ•„์š”์— ๋”ฐ๋ผ ๊ณต๋ถ€ํ•œ ๊ฒƒ๋“ค์„ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ ์ค‘์—์„œ๋„ Yannic Kilcher์˜ ์ฑ„๋„์€ ์ตœ๊ทผ ์—„์ฒญ๋‚œ ์†๋„๋กœ ์ตœ์‹  ๋…ผ๋ฌธ๋“ค์„ ๋ฆฌ๋ทฐํ•ด์ฃผ๋Š” ์œ ํŠœ๋ธŒ ์ฑ„๋„์ด๋‹ค.(์ •๋ง ๋ฉ‹์žˆ๋Š” ์—ฐ๊ตฌ์ž ๋ฐ ์œ ํŠœ๋ฒ„) ๋‹ค๋ฅธ ๋ฌด์—‡๋ณด๋‹ค ์ตœ๋Œ€ ๊ฐ•์ ์€ ๋…ผ๋ฌธ์ด ๋‚˜์˜จ์ง€ 1~2์ผ๋งŒ์— ๋ฆฌ๋ทฐ๋ฅผ ํ•ด์ฃผ๋ฉฐ, ์‹ค์ œ๋กœ ๋…ผ๋ฌธ์„ high level์—์„œ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋งŒ ์งš์–ด์ค€๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ NLP ์ด์™ธ์—๋„ ์ •๋ง ํฅ๋ฏธ๋กœ์šด ๋”ฅ๋Ÿฌ๋‹ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋“ค์ด ๋งŽ์œผ๋‹ˆ ๊ผญ ํ•œ๋ฒˆ ๋ฐฉ๋ฌธํ•ด์„œ ์‚ดํŽด๋ณด์„ธ์š”! ๐Ÿ˜

๋…ผ๋ฌธ ํ•œ ์ค„ ์š”์•ฝ

total-compute-used-during-training

์–ด๋งˆ๋ฌด์‹œํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ(1,750์–ต๊ฐœ)๋ฅผ ๊ฐ€์ง€๊ณ  fine-tuning ์—†์ด few-shot learning์„ ํ†ตํ•ด ๋ช‡๋ช‡ NLP task์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

0. Abstract

๋งŽ์€ ์–ธ์–ด ๋ชจ๋ธ๋“ค์€ ํ˜„์žฌ pre-training โ†’ fine-tuning ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋˜๊ณ  ์žˆ๋Š”๋ฐ, ์ด๋Š” ์ถ”๊ฐ€์ ์ธ ๋งŽ์€ ๋ ˆ์ด๋ธ”๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•˜๋‹ค. ๋งŽ์€ ๋ฆฌ์†Œ์Šค๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค๋Š” ๋ฌธ์ œ์ ๊ณผ ๋”๋ถˆ์–ด ์• ์ดˆ์— ์‚ฌ๋žŒ์€ ๋ช‡ ์•ˆ๋˜๋Š” ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด์„œ๋„ ์ƒˆ๋กœ์šด NLP task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ์ด๋Ÿฌํ•œ ๊ด€์ ์—์„œ GPT-3๋Š” ํŒŒ๋ฆฌ๋ฏธํ„ฐ์˜ ์ˆ˜๋ฅผ 1,750์–ต๊ฐœ๊นŒ์ง€ ๋Š˜๋ฆฌ๊ณ , fine-tuning๊ณผ gradient updates ์—†์ด few-shot demonstrations๋งŒ์„ ํ†ตํ•ด NLP task์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค. (์ „๋ถ€ ๋‹ค SOTA๋ฅผ ๋‹ฌ์„ฑํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.)
    • translation, question-answering, cloze(missing ๋‹จ์–ด๋ฅผ ์ฑ„์›Œ ๋„ฃ๋Š” ํ…Œ์Šคํฌ) task, unscrambling words, 3-digit arithmetic(์„ธ์ž๋ฆฌ ์‚ฐ์ˆ˜) ๋“ฑ์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.
  • ๋˜ํ•œ GPT-3๋Š” ์‚ฌ๋žŒ์ด ๊ตฌ๋ณ„ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ •๋„์˜ ์‹ ๋ฌธ ๊ธฐ์‚ฌ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ๋„ ํ•˜์˜€๋‹ค.
  • ๊ทธ ์™ธ ํ•ด๋‹น ๋ชจ๋ธ๊ณผ ๊ด€๋ จํ•œ ์‚ฌํšŒ์  ์ด์Šˆ์— ๋Œ€ํ•ด์„œ๋„ ๋‹ค๋ฃฌ๋‹ค.

1. Introduction

bert-and-gpt

GPT (Generative Pretrained Transformer)๋Š” Transformer์˜ Decoder ๋ถ€๋ถ„์„, BERT (Bidirectional Encoder Representations from Transformer)๋Š” Transformer์˜ Encoder๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ๊ด€๊ณ„๊ฐ€ ์Œ๋‘ฅ์ด ์ž๋งค์™€ ๊ฐ™๋‹ค. ์‹ค์ œ๋กœ ๊ทธ ๋ฐœ์ „์˜ ์—ญ์‚ฌ๋ฅผ ๋ณด๋ฉด ELMO (2018.02) โ†’ GPT-1 (2018.05) โ†’ BERT (2018.10) โ†’ XLNet (2019.07) โ†’ GPT-2 (2019.02) โ†’ RoBERTa (2019.07) โ†’ ALBERT (2019.09) โ†’ T5 (2019.10) โ†’ ... ์ฒ˜๋Ÿผ ๊ฐ™์€ Transformer์—์„œ ์ถœ๋ฐœํ•ด์„œ ์„œ๋กœ์˜ ์„ฑ๋Šฅ์„ ์—Ž์น˜๋ฝ ๋’ค์น˜๋ฝํ•˜๋ฉด์„œ ๋ฐœ์ „ํ•ด์™”๋‹ค. ๊ธฐ๋‚˜๊ธด ์„ ์˜์˜ ๊ฒฝ์Ÿ ๋์— ๋งˆ์นจ๋‚ด ์ด๋ฒˆ์—๋Š” GPT๋ฅผ ์ค„๊ณง ์—ฐ๊ตฌํ•ด์˜จ OpenAI์—์„œ 2020๋…„ 05์›” GPT-3 ๋ชจ๋ธ์„ ๋“ค๊ณ ๋‚˜์™”๋‹ค.

GPT โ†’ GPT-2

  • Layer normalization์ด ๊ฐ๊ฐ์˜ input์˜ sub-block์œผ๋กœ ์˜ฎ๊ฒจ์กŒ์œผ๋ฉฐ, ๋งˆ์ง€๋ง‰ self-attention block์—๋„ ๋”ํ•ด์ง.
  • A modified initialization which accounts for the accumulation on the residual path with model depth is used.
  • residual layer์—์„œ initialization ๊ฐ€์ค‘์น˜๋ฅผ scaling.
  • vocabulary 50,257๊ฐœ๊นŒ์ง€ ์ฆ๊ฐ€.
  • context size๋ฅผ 512์—์„œ 1024๋กœ ์ฆ๊ฐ€. ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋ฅผ 512๊ฐœ๋กœ ์ฆ๊ฐ€.

GPT-2 โ†’ GPT-3

  • Transformer ๋ถ€๋ถ„์—์„œ dense and locally banded sparse attention pattern๋ฅผ ์‚ฌ์šฉ.

์•ž์„  ์—ฌ๋Ÿฌ ์–ธ์–ด ๋ชจ๋ธ๋“ค์€ ์ผ๋ฐ˜์ ์ธ ์–ธ์–ด์— pre-trained ํ•˜๊ณ , ํŠน์ • task์— ๋งž๊ฒŒ fine-tuningํ•˜๊ฒŒ ๋˜๋Š”๋ฐ ์• ์ดˆ์— ์ด fine-tuningํ•˜๋Š” ๊ณผ์ •์ด ๋งŒ๋งŒ์น˜ ์•Š๋‹ค.

  1. fine-tuning์„ ์œ„ํ•ด์„  pre-trained์„ ์ž˜ ํ–ˆ๋‹ค๊ณ ํ•ด๋„, ํ•ด๋‹น task๋ฅผ ์œ„ํ•œ ๋ ˆ์ด๋ธ”๋˜์–ด ์žˆ๋Š” ์–‘์งˆ์˜ ๋ฐ์ดํ„ฐ์…‹์ด ๋˜ ๋‹ค์‹œ ํ•„์š”.
  2. fine-tuning์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐ ๋˜ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ ๊ณผ์ •์ด ํ•„์—ฐ์ .
  3. ํŠน์ • task์— fine-tuning ํ•˜๋Š” ๊ฒƒ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ๋žŒ์ด ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋‹˜.
  4. ๊ฑฐ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์— pre-trained ํ•œ ํ›„, fine-tuning ํ•˜๋Š” ๊ฒƒ์€ narrow task(ํŠน์ • task์— ๋Œ€ํ•ด์„œ๋งŒ ํ•™์Šต)์—๋งŒ ํ•™์Šตํ•˜๋Š” ๊ฒƒ.

์ฆ‰, pre-trained + fine-tuning ํŒจ๋Ÿฌ๋‹ค์ž„์€ ์—ฌ๋Ÿฌ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

in-context-learning

  • ์–ธ์–ด ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๋™์•ˆ in-context learning์œผ๋กœ ๋‹ค์–‘ํ•œ ํŒจํ„ด ์ธ์ง€ ๋Šฅ๋ ฅ์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ

  • ์‚ฌ์น™์—ฐ์‚ฐ, ์˜คํƒ€ ๊ฒ€์ƒ‰, ๋ฒˆ์—ญ ๋“ฑ์˜ ํŒจํ„ด์„ few-shot learning์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ

efficient-use-of-in-context-information

  • ์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด, in-context learning์˜ ๊ฒฝ์šฐ ๊ฑฐ๋Œ€ํ•œ ๋ชจ๋ธ(๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ)์„ ์‚ฌ์šฉํ•  ์ˆ˜ ๋ก ๊ทธ ์„ฑ๋Šฅ์ด ์••๋„์ ์œผ๋กœ ํ–ฅ์ƒ๋จ.

๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•˜๋Š” GPT-3๋Š” ํ•œ๋งˆ๋””๋กœ Transformer์˜ decoder ๋ถ€๋ถ„์„ Autoregressive ๋ฐฉ๋ฒ•์œผ๋กœ few-shot ํ•™์Šตํ•œ fine-tuning ๊ณผ์ •์—†๋Š” ๋‹ค์šฉ๋„ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ ์ด๋‹ค.

์–ธ์–ด ๋ชจ๋ธ์€ fluidity๊ณผ generality๋ฅผ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค.

๋‹ค์‹œ ์ •๋ฆฌํ•˜๋ฉด ํ•ด๋‹น ๋…ผ๋ฌธ์ด ๋‹ค๋ฃจ๋Š” ํ•ต์‹ฌ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ํŒŒ๋ผ๋ฏธํ„ฐ 1,750์–ต๊ฐœ์˜ Autoregressive(์ „์— ๋‚˜์˜จ ํ† ํฐ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋Š” ์–ธ์–ด ๋ชจ๋ธ ์ข…๋ฅ˜) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ in-context learning ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•  ๋•Œ, (24๊ฐœ์˜ NLP dataset๋“ค์„ ์ด์šฉํ•˜๋Š” task์™€ ๊ธฐ์กด์˜ dataset์— ๋“ค์–ด์žˆ์ง€ ์•Š์€ ์ •๋ณด์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด task์— ๋Œ€ํ•ด ํ‰๊ฐ€) 3๊ฐ€์ง€์˜ ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ
    • "few-shot learning": ๋ช‡ ๊ฐœ์˜ (10๊ฐœ์—์„œ 100๊ฐœ) demonstrations(์„ค๋ช… ์˜ˆ์ œ)๋งŒ ๋ณด์—ฌ์คŒ
    • "one-shot learning": ํ•œ ๊ฐœ์˜ demonstration๋งŒ ๋ณด์—ฌ์คŒ
    • "zero-shot learning": ์–ด๋Š demonstration๋„ ๋ณด์—ฌ์ฃผ์ง€ ์•Š๊ณ , ์ž์—ฐ์–ด์— ๋Œ€ํ•œ instruction๋งŒ ๋ณด์—ฌ์คŒ (์•„์ง ์ •ํ™•ํžˆ ์ดํ•ดํ•˜์ง€ ๋ชปํ•จ.)
  • GPT-3๋ฅผ fine-tuning ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค๋ฃจ์ง€ ์•Š๊ณ  future work์œผ๋กœ ๋‚จ๊ฒจ๋‘”๋‹ค.
  1. zero-shot๊ณผ one-shot์—์„œ promisingํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๊ณ , few-shot๊ฐ€ ๋ช‡๋ช‡์˜ task์—์„œ๋Š” SOTA๋ฅผ ๋‹ฌ์„ฑ.
    • ex) CoQA: 85.0 F1, TriviaQA: 71.2% ์ •ํ™•๋„ ๋‹ฌ์„ฑ
  2. on-the-fly reasoning (๋ฐ”๋กœ๋ฐ”๋กœ(์—ฌ๋Ÿฌ ๋ฌธ๋งฅ์„ ๋ณด์ง€ ์•Š๊ณ ? ์†๋„?) ์ถ”๋ก ํ•ด์•ผํ•˜๋Š”) task์—์„œ๋„ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„.
    • ex) unscrambling words / performing arithmetic / using novel words in a sentence after seeing them defined only once.
  3. ์‚ฌ๋žŒ์ด ๋ถ„๋ณ„ํ•  ์ˆ˜ ์—†๋Š” ๋‰ด์Šค ๊ธฐ์‚ฌ ์ž‘์„ฑ.

  4. ์ด๋Ÿฌํ•œ GPT-3์˜ ๊ฑฐ๋Œ€ํ•œ ํฌ๊ธฐ์—๋„ few-shot learning์ด ์ œ๋Œ€๋กœ๋œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•œ NLP task๊ฐ€ ์žˆ์Œ.
    • ์ถ”๋ก  ๋ฌธ์ œ ANLI ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋…ํ•ด ๋ฌธ์ œ RACE, QuAC ๋ฐ์ดํ„ฐ์…‹
  5. GPT-3์˜ ํ•œ๊ณ„์ .

  6. Commom Crawl ์‹œ์— ๋ฐ์ดํ„ฐ ์˜ค์—ผ ๋ฌธ์ œ ์ •๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ.

  7. ๋น„๊ต์  ์ ์€ ๋ชจ๋ธ(1.25์–ต์—์„œ 130์–ต๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ)๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜.

  8. ์ด๋Ÿฌํ•œ ๋„“์€ ํ™œ์šฉ์„ฑ์œผ๋กœ ์ธํ•œ bias, fairness, broader societal impact๋ฅผ ๋‹ค๋ฃธ.

2. Approach

๋ณธ ๋…ผ๋ฌธ์˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ชจ๋ธ, ๋ฐ์ดํ„ฐ, ํ•™์Šต ๋“ฑ ์—ฌ๋Ÿฌ ๋ฉด์—์„œ GPT-2์™€ ๋น„์Šทํ•˜์ง€๋งŒ, ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ํ™•์žฅํ•˜๊ณ  ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ์™€ ๋‹ค์–‘์„ฑ, ํ•™์Šต ๊ณผ์ •์„ ์ฆ๊ฐ€์‹œ์ผฐ๋‹ค. ๋ณธ ๋‚ด์šฉ์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ํ•ด๋‹น ์šฉ์–ด๋“ค์— ๋Œ€ํ•œ ์ •์˜์™€ ์ฐจ์ด์ ๋“ค์„ ๋‹ค๋ค„๋ณธ๋‹ค.

FT-FS-1S-0S

  • Fine-Tuning (FT) ๋ณดํ†ต ๋ช‡ ์ฒœ ๋˜๋Š” ๋ช‡ ๋ฐฑ๊ฐœ์˜ ๋ ˆ์ด๋ธ” ๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ๊ฐ€์žฅ ํฐ ์žฅ์ ์€ ์—ฌ๋Ÿฌ benchmark์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ฐ๊ฐ์˜ task์— ๋Œ€ํ•ด ์ƒˆ๋กœ์šด ํฐ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•  ๋ฟ ๋งŒ ์•„๋‹ˆ๋ผ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค๋Š” ๊ฒƒ์ด ์•ฝ์ ์œผ๋กœ ๋ฝ‘ํžŒ๋‹ค. ๋˜ํ•œ, ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์˜ ์ž˜๋ชป๋œ feature๋ฅผ ์ด์šฉํ• ์ง€๋„ ๋ชจ๋ฅธ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” task์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š”(task-agnostic) ์„ฑ๋Šฅ์„ ๋ณด์ด๊ธฐ ์œ„ํ•ด FT๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋‚˜, ์ฐจํ›„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์‹ค์ œ๋กœ๋Š” ์œ ๋งํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

  • Few-Shot (FS) ๋ชจ๋ธ์— ๋ช‡ ๊ฐœ์˜ demonstrations๋งŒ ๋ณด์—ฌ์ฃผ๊ณ , ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๋ฅผ ํ—ˆ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด ์ด๋ฅผ learning์ด๋ผ ๋ถ€๋ฅผ ์ˆ˜ ์žˆ๋Š” ์ง€์™€ ์‹ค์ œ๋กœ ๋ฌด์Šจ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š”์ง€ ์•„์ง์€ ์ดํ•ด๋˜์ง€ ์•Š๋Š”๋‹ค. ์›๋ฌธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed.

  • ์˜์–ด์—์„œ ๋ถˆ์–ด๋กœ ๋ฒˆ์—ญํ•˜๋Š” ์˜ˆ๋ฅผ ๋“ค๋ฉด, $K$๊ฐœ์˜ ์˜์–ด์™€ ๊ทธ์— ๋Œ€์‘ํ•˜๋Š” ๋ถˆ์–ด ์˜ˆ์‹œ๋ฅผ ๋ชจ๋ธ์—๊ฒŒ ๋ณด์—ฌ์ฃผ๊ณ  ๋งˆ์ง€๋ง‰ ํ•˜๋‚˜์˜ ์˜ˆ์‹œ์—์„œ ์˜์–ด๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋ถˆ์–ด๋ฅผ ์˜ˆ์ธกํ•ด๋ณด๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋ถ€๋ถ„๋„ ์œ„์—์„œ์™€ ๊ฐ™์€ ๋งฅ๋ฝ์œผ๋กœ ์™„๋ฒฝํžˆ ์ดํ•ดํ•˜์ง€ ๋ชปํ–ˆ์œผ๋ฉฐ ์›๋ฌธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving K examples of context and completion, and then one final example of context, with the model expected to provide the completion.

  • $K$๋Š” ๋ณดํ†ต 10์—์„œ 100์ด๋ฉฐ, ๋ชจ๋ธ์˜ context window์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค. FS์˜ ์žฅ์ ์€ ๋‹น์—ฐํžˆ ๋งŽ์€ ๋ฐ์ดํ„ฐ์–‘์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๋งŽ์ง€๋งŒ ํ˜‘์†Œํ•œ fine-tuning์šฉ ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šตํ•˜๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ์ค„์ธ๋‹ค.

  • One-Shot (1S) 1S์€ ํ•˜๋‚˜์˜ demonstration๋งŒ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ์ œ์™ธํ•˜๊ณ ๋Š” FS์™€ ๋™์ผํ•˜๋‹ค.

  • Zero-Shot (0S) 0S๋Š” ์–ด๋Š demonstration๋„ ๋ณด์—ฌ์ฃผ์ง€ ์•Š๋Š” ๋‹ค๋Š” ๊ฒƒ์„ ์ œ์™ธํ•˜๊ณ ๋Š” 1S์™€ ๋™์ผํ•˜๋‹ค. ๊ฐ€์žฅ ํŽธ์˜์ ์ด๊ณ , ๊ฐ•๊ฑดํ•˜๋ฉฐ, ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ์Šต๋“ํ•  ํ™•๋ฅ ์„ ์ค„์—ฌ์ฃผ๋Š” ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ์œ„์˜ ์„ธ ๊ฐ€์ง€ ์ค‘ ๊ฐ€์žฅ ์–ด๋ ค์šด ๋ฐฉ๋ฒ•์ด๊ธฐ๋„ ํ•˜๋‹ค. (์‹ฌ์ง€์–ด ์‚ฌ๋žŒ์—๊ฒŒ๋„ ์–ด๋ ค์šด task์ผ ์ˆ˜ ์žˆ๋‹ค.)

2.1 Model and Architectures

model-architecture

์‚ฌ์šฉ๋œ ๋ชจ๋ธ์€ GPT-2๋ž‘ ๋™์ผํ•˜์ง€๋งŒ (modified initialization, pre-normalization, reversible tokenization), Transformer ๋ถ€๋ถ„์—์„œ dense and locally banded sparse attention pattern๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. (d์•„ํ‚คํ…์ณ)

2.2 Training Dataset

Common Crawl dataset์˜ ์›นํฌ๋กค๋ง ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์˜€๊ณ , ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋ณ„๋กœ ์ข‹์ง€ ์•Š์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์–ด ๋‹ค์Œ 3๊ฐ€์ง€ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

  1. CommonCrawl ๋ฐ์ดํ„ฐ์…‹์„ high-quality reference corpora๋กœ ํ•„ํ„ฐ๋งํ•˜๊ณ ,
  2. ๋ฌธ์„œ ๋ ˆ๋ฒจ์—์„œ fuzzy deduplication๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ ,
  3. high-quality reference corpora(WebText ๋“ฑ)๋ฅผ ํ•™์Šต์‹œ์— ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.

dataset

2.3 Training Process

2.4 Evaluation

3. Results

GPT์˜ ๊ณ ์งˆ์ ์ธ ๋ฌธ์ œ๋Š” ์–‘๋ฐฉํ–ฅ ์ •๋ณด๋ฅผ ํš๋“ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ์ด ์ ์ด ์„ฑ๋Šฅ ํ‰๊ฐ€์—์„œ ๊ณ ์Šค๋ž€ํžˆ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

4. Measuring and Preventing Memorization Of Benchmarks

5. Limitations

6. Broader Impacts