23

GPT-3: A Disappointing Paper

 3 years ago
source link: https://www.greaterwrong.com/posts/ZHrpjDc3CepSeeBuE/gpt-3-a-disappointing-paper
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

This post is a com­pila­tion of two posts I re­cently made on tum­blr.

For con­text: I have been an en­thu­si­as­tic user of GPT-2, and have writ­ten a lot about it and trans­former mod­els more gen­er­ally. My other writ­ing on this topic in­cludes hu­man psy­chol­in­guists: a crit­i­cal ap­praisal and “the trans­former … “ex­plained?” See also my tum­blr bot , which uses GPT-2 as a core com­po­nent.

Part 1

ar­gu­mate said:

@nos­talge­braist , give us the goss on how GPT-3 com­pares with GPT-2!

I haven’t read the pa­per su­per care­fully yet, but I am pretty sure of the fol­low­ing:

1.1: On GPT-3′s mundanity

“GPT-3″ is just a big­ger GPT-2. In other words, it’s a straight­for­ward gen­er­al­iza­tion of the “just make the trans­form­ers big­ger” ap­proach that has been pop­u­lar across mul­ti­ple re­search groups since GPT-2.

This ex­cerpt cap­tures this pretty clearly:

Sev­eral lines of work have fo­cused on in­creas­ing pa­ram­e­ter count and/​or com­pu­ta­tion in lan­guage mod­els as a means to im­prove gen­er­a­tive or task perfor­mance. […] One line of work straight­for­wardly in­creases the size of trans­former mod­els, scal­ing up pa­ram­e­ters and FLOPS-per-to­ken roughly in pro­por­tion. Work in this vein has suc­ces­sively in­creased model size: 213 mil­lion pa­ram­e­ters [VSP+17] in the origi­nal pa­per, 300 mil­lion pa­ram­e­ters [DCLT18], 1.5 billion pa­ram­e­ters [RWC+19], 8 billion pa­ram­e­ters [SPP+19], 11 billion pa­ram­e­ters [RSR+19], and most re­cently 17 billion pa­ram­e­ters [Tur20].

The first two pa­pers men­tioned here are the origi­nal trans­former for ma­chine trans­la­tion (VSP+17) and BERT (DCLT18). The pa­ram­e­ter count doesn’t ac­tu­ally in­crease that much be­tween those two.

The third one (RWC+19) is GPT-2. The pa­ram­e­ter counts jumps up 5x there. Ar­guably the point of the GPT-2 pa­per was “it sounds dumb and too easy, but amaz­ing things hap­pen if you just make a trans­former big­ger” – and this “GPT-3″ pa­per is mak­ing the same point with big­ger num­bers.

“GPT-3” is a trans­former with 175 billion pa­ram­e­ters. It’s an­other big jump in the num­ber, but the un­der­ly­ing ar­chi­tec­ture hasn’t changed much.

In one way this is a fair thing to call “GPT-3″: it’s an­other step in the new biggen­ing tra­di­tion which GPT-2 ini­ti­ated.

But in an­other way it’s pretty an­noy­ing and mis­lead­ing to call it “GPT-3.” GPT-2 was (ar­guably) a fun­da­men­tal ad­vance, be­cause it demon­strated the power of way big­ger trans­form­ers when peo­ple didn’t know about that power. Now ev­ery­one knows, so it’s the fur­thest thing from a fun­da­men­tal ad­vance. (As an illus­tra­tion, con­sider that their new big model de­serves the ti­tle “GPT-3″ just as much, and just as lit­tle, as any of the last 3 big mod­els they men­tion in that para­graph.)

1.2: On “few-shot learn­ing”

The pa­per seems very tar­geted at the NLP com­mu­nity, which I mean in al­most a wholly nega­tive way. (De­spite be­ing part of the NLP com­mu­nity, I guess.)

The GPT-2 pa­per ar­gued that lan­guage mod­els (text pre­dic­tors) could do well, or in some cases “at least not ter­ribly,” at the spe­cial­ized tasks used as NLP bench­marks – even with­out be­ing told any­thing about those tasks. This was sort of neat, but mostly as a demon­stra­tion of the lan­guage model’s power.

The “zero-shot” learn­ing they demon­strated in the pa­per – stuff like “adding tl;dr af­ter a text and treat­ing GPT-2′s con­tinu­a­tion there­after as a ‘sum­mary’” – were weird and goofy and not the way any­one would want to do these things in prac­tice. It was more cool as a demon­stra­tion that suffi­ciently good lan­guage mod­els could “do it all,” even things they weren’t in­tended for; the point wasn’t that they were world-class great at these tasks, the point was the gap be­tween their perfor­mance and their low level of prepa­ra­tion. Kinda like a child prodigy.

In the GPT-3 pa­per, they’ve in­tro­duced a new (…ish? maybe?) way for lan­guage mod­els to be good at the stan­dard bench­marks. Now it’s about how they can “figure out” what they’re sup­posed to be do­ing across the course of a text, i.e. in­stead of prompt­ing the model with one thing like

Q: What is the cap­i­tal of France?

they in­stead prompt it with sev­eral, like

Q: What is the cap­i­tal of France?

Q: What is the cap­i­tal of Spain?

Q: What is the cap­i­tal of Lithua­nia?

Q: What is the cap­i­tal of Brazil?

The NLP-com­mu­nity-rele­vant point of “GPT-3″ is that lan­guage mod­els can do much bet­ter on the stan­dard bench­marks than we thought, via this kind of multi-prompt­ing and also via even more biggen­ing. Put­ting those two changes to­gether, you can even even beat the state of the art on a few tasks (of many).

I can imag­ine some­one view­ing this as very im­por­tant, if they thought it showed an abil­ity in trans­former LMs to “pick things up on the fly” in an ex­tremely data-effi­cient, hu­man-like way. That would be rele­vant to some of Gary Mar­cus’ con­cerns .

But the pa­per seems to­tally, weirdly un­in­ter­ested in the “learn­ing on the fly” an­gle. Their pa­per has many, many figures graph­ing perfor­mance against pa­peme­ter count – big­ger is bet­ter yet again – but I can only find one figure graph­ing perfor­mance against their pa­ram­e­ter K, the num­ber of dis­tinct task ex­am­ples in the prompt (K is 1 and 4 in the two cap­i­tals ex­am­ples).

[It turns out there’s an­other one I missed on my first read – Fig. 1.2 on page 4. I dis­cuss this in Part 2 be­low.]

And that figure is, uh, not en­courag­ing:

ZnqmYfn.png!web

They do bet­ter with one task ex­am­ple than zero (the GPT-2 pa­per used zero), but oth­er­wise it’s a pretty flat line; ev­i­dently there is not too much pro­gres­sive “learn­ing as you go” here.

(Oddly, the cap­tion for this figure ex­plains these are dev set re­sults so not di­rectly com­pa­rable to the test set re­sults given as hori­zon­tal lines – which doesn’t stop them from plot­ting them! Else­where, they do re­port test set re­sults for Su­perGLUE, but only for K=32. Also, I’m not a fan of this plot’s lack of er­ror bars.)

1.3: On benchmarks

In­stead, their in­ter­est is al­most com­pletely in how good they can get on the bench­marks in ab­solute terms.

This is why I say it’s aimed at the NLP com­mu­nity: these are the met­rics that whole com­mu­nity mea­sures it­self against, so in a triv­ial sense the com­mu­nity “has to” find these re­sults in­ter­est­ing. But by now, this starts to feel like Good­hart’s Law.

The rea­son GPT-2 was so cool wasn’t that it did so well on these tasks. It was that it was a re­ally good lan­guage model that demon­strated a new over­all un­der­stand­ing of lan­guage . Co­erc­ing it to do well on stan­dard bench­marks was valuable (to me) only as a flam­boy­ant, semi-comedic way of point­ing this out, kind of like show­ing off one’s artis­tic tal­ent by paint­ing (but not paint­ing es­pe­cially well ) with just one’s non-dom­i­nant hand.

GPT-2 isn’t cool be­cause it’s good at “ques­tion an­swer­ing,” it’s cool be­cause it’s so good at ev­ery­thing that it makes car­ing about “ques­tion an­swer­ing” per se feel tiny, ir­rele­vant.

The trans­former was such an ad­vance that it made the com­mu­nity cre­ate a new bench­mark, “Su­perGLUE,” be­cause the pre­vi­ous gold stan­dard bench­mark (GLUE) was now too easy .

GPT-3 is so lit­tle of an ad­vance, it doesn’t even do that well at Su­perGLUE. It just does okay with its dom­i­nant hand tied be­hind its back.

“No, my 10-year-old math prodigy hasn’t proven any new the­o­rems, but she can get a perfect score on the math SAT in un­der 10 min­utes. Isn’t that ground­break­ing?”

Sort of? Not es­pe­cially?

1.4: On annoyance

The more I think about this pa­per, the more an­noy­ing it is. Trans­form­ers are ex­tremely in­ter­est­ing. And this is about the least in­ter­est­ing trans­former pa­per one can imag­ine in 2020.

Part 2

2.1: On “few-shot learn­ing,” again

On my first read, I thought there was only one plot show­ing how perfor­mance varies with K (num­ber of few-shot sam­ples), but I missed the one very early in the pa­per, Fig 1.2 on p. 4.

That plot is more im­pres­sive than the other one, but doesn’t change my im­pres­sion that the au­thors are not very in­ter­ested in show­ing off “pro­gres­sive learn­ing” over the course of a text.

The ar­gu­ment they’re try­ing to make with Fig 1.2 is that more pro­gres­sive learn­ing hap­pens with big­ger mod­els, and hence that their over­all strat­egy – “use big mod­els + few-shot learn­ing to get good scores on bench­marks” – benefits from an in­ter­ac­tion effect above and be­yond the in­de­pen­dent effects of its two parts (big mod­els, few-shot learn­ing).

Again, this is in­ter­est­ing if you care about scores on NLP bench­marks, but I have trou­ble see­ing much qual­i­ta­tive sig­nifi­cance for over­all lan­guage un­der­stand­ing.

2.2: On novel words

One of their ex­per­i­ments, “Learn­ing and Us­ing Novel Words,“ strikes me as more re­mark­able than most of the oth­ers and the pa­per’s lack of fo­cus on it con­fuses me. (This is sec­tion 3.9.5 and table 3.16.) The task is closely re­lated to the Wug test – it’s the kind of thing Gary Mar­cus fo­cused on in his cri­tique of GPT-2 – and looks like this:

[Hu­man prompt] To do a “far­dud­dle” means to jump up and down re­ally fast. An ex­am­ple of a sen­tence that uses the word far­dud­dle is:

[GPT-3 con­tinu­a­tion] One day when I was play­ing tag with my lit­tle sister, she got re­ally ex­cited and she started do­ing these crazy far­dud­dles.

This is the sort of task that de­vel­op­men­tal lin­guists study in hu­man chil­dren, and which past NLP mod­els have had trou­ble with. You’d think a suc­cess on it would de­serve top billing. The au­thors ap­par­ently re­port a suc­cess here, but treat it as an unim­por­tant sideshow: they say they tried it 6 times and got 6 suc­cesses (100% ac­cu­racy?!), but they ap­par­ently didn’t con­sider this im­por­tant enough to try the same thing on a larger sam­ple, com­pute a real met­ric, show var­i­ance w/​r/​t pa­ram­e­ters, etc. Mean­while, they did those things on some­thing like 40 other tasks, mostly far less in­ter­est­ing (to me). Con­fus­ing!

2.3: On ab­stract reasoning

In ad­di­tion to the usual NLP bench­marks, they tried some “syn­thetic or qual­i­ta­tive” tasks (sec­tion 3.9). Their stated goal with these is to clar­ify the role the ac­tual learn­ing in “few-shot learn­ing,” sep­a­rat­ing it from mere fa­mil­iar­ity with similar-look­ing text:

One way to probe GPT-3’s range of abil­ities in the few-shot (or zero- and one-shot) set­ting is to give it tasks which re­quire it to perform sim­ple on-the-fly com­pu­ta­tional rea­son­ing, rec­og­nize a novel pat­tern that is un­likely to have oc­curred in train­ing, or adapt quickly to an un­usual task.

The “syn­thetic or qual­i­ta­tive” tasks are:

  • var­i­ous forms of sim­ple ar­ith­metic (like “add two 2-digit num­bers”)

  • var­i­ous ana­gram/​re­ver­sal/​etc tasks op­er­at­ing on the in­di­vi­d­ual let­ters of words

  • SAT analogies

This line of work feels in­suffi­ciently the­o­rized, and thus hard to in­ter­pret.

Con­sider the ar­ith­metic tasks. Let’s grant the au­thors’ premise that the model has not just mem­o­rized some lookup table for ar­ith­metic prob­lems – it’s re­ally “do­ing the prob­lems” on the fly. Then, there are 2 things the model could be do­ing here (prob­a­bly some of each si­mul­ta­neously):

  1. It might have de­vel­oped a real in­ter­nal model of ar­ith­metic from see­ing many re­lated num­bers in train­ing texts, and is ap­ply­ing this model to do the prob­lems like you or I would

  2. It might have de­vel­oped some generic rea­son­ing ca­pa­bil­ity for ar­bi­trary ab­stract tasks, which can han­dle ar­ith­metic as a par­tic­u­lar case of a much more generic class of prob­lems (e.g. it could also pick up var­i­ous “fake ar­ith­metics” where +, -, etc have non-stand­ing mean­ings, if ap­pro­pri­ately prompted)

In­so­far as #1 is hap­pen­ing, the mul­ti­ple prompts of few-shot learn­ing shouldn’t mat­ter : if the model knows how real (not fake) ar­ith­metic works be­cause it’s seen it in text, then ad­di­tional ex­am­ples don’t help “lo­cate the task.” That is, if it has only learned to do real ar­ith­metic, it shouldn’t need to be told “in this task the + sym­bol has the stan­dard mean­ing,” be­cause its abil­ity de­pends on that as­sump­tion any­way.

So, if we’re mostly see­ing #1 here, this is not a good demo of few-shot learn­ing the way the au­thors think it is.

In­so­far as #2 is hap­pen­ing, the few-shot prompts do mat­ter: they “lo­cate the mean­ings” of the sym­bols in the large space of pos­si­ble for­mal sys­tems. But #2 is wild : it would rep­re­sent a kind of non-lin­guis­tic gen­eral in­tel­li­gence abil­ity which would be re­mark­able to find in a lan­guage model.

I re­ally doubt this is what the au­thors are think­ing. If they think lan­guage mod­els are fully gen­eral rea­son­ers, why not high­light that? The ab­stract rea­son­ing ca­pac­ity of trans­form­ers has already been more clearly probed with­out the con­found­ing as­pects of nat­u­ral lan­guage, and a pri­ori there are few rea­sons to think a very large lan­guage-spe­cific model should de­velop strong abil­ities here (while there are a pri­ori rea­sons to think the abil­ities are sub­tle forms of text recog­ni­tion/​mem­o­riza­tion the au­thors’ method­ol­ogy was not able to de­tect).

My best guess is that the au­thors imag­ine a fac­tor­iza­tion of the task into “know­ing how to do it” and “know­ing we are do­ing it right now.” Train­ing on text teaches you how to do (real) ar­ith­metic, and the few-shot prompts tell you “right now we are do­ing (real) ar­ith­metic, not some other thing you know how to do.”

But ar­ith­metic is a re­ally bad choice if you want to probe this! The au­thors use K=50 here, mean­ing they give the model 50 cor­rect ex­am­ples of sim­ple math prob­lems to let it “lo­cate the task.” But no one who can do this task should need 50 ex­am­ples of it.

What in­for­ma­tion is con­veyed by ex­am­ple #50 that wasn’t already known by ex­am­ple #49? What are we rul­ing out here? Trol­lish for­mal sys­tems that look like ad­di­tion 98% of the time? “Ad­di­tion, ex­cept ’52′ ac­tu­ally means ’37′ but ev­ery­thing else is the same?” Do we have to rule this out when you should have (and the model must have) a strong prior to­wards real ad­di­tion?

I don’t know what the au­thors are try­ing to do here, and I think they may not know, ei­ther.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK