HyperWrite
Engineering Blog
So It Turns Out that Davinci Has Always Been Able to Do Word in Context
February 28, 2023
Author: Matt Brockman (mattbrockman@othersideai.com)
Since OpenAI's initial paper on GPT-3, it's been thought that Davinci could not do the NLP task Word in Context which measures the model's ability to differentiate the sense of the word in multiple contexts. This is an important ability for many editing and writing tools, so it's important to know how much current models are limited on this task. We ran the benchmark against all of the available Davinci models, finding that the original model as well as the subsequent iterations have all been capable of getting at least 60% the dev set, improving with each iteration. The code and results are available at https://github.com/OthersideAI/WordInContext
Introduction
The Word in Context (WIC) task is a benchmark for measuring the ability of a model to differentiate the sense of a word in multiple contexts. The problem sets look like this:
WIC Example Problem
{ context-1: The French doors admit onto the yard . context-2: He admitted his errors . position: 3-1 pos: V target: admit label: F }
The challenge is for the model to evaluate the contexts and determine the final label for the target word. In this example, the model needs to evaluate whether the sense of the word "Admit" is similar between the sentences "The French doors admit onto the yard" and "He admitted his errors". The label "F" indicates that the sense is different, while a "T" would indicate that the sense is the same. Although it's been thought the GPT-3 series models are essentially blind to this task, one of our engineers had found during the GPT-3 beta that it could perform this task with a bit of a think-through step and we figured we'd go and see whether the results still hold up.
A note of warning
One thing that we're not unconcerned about is whether the dev set could have been included in training the updated GPT models. We're mostly interested in the best way to prime the models so we activate the right neurons to perform arbitrary tasks for our users so we care more about figuring out the prompting method that works best within the model, but it is worth noting that some increases in performance between models might be a bit of inadvertent cheating.
Results
Since there's been four iterations of Davinci (the original Davinci, text-davinci-001, text-davinci-002, and text-davinci-003) so far, we tried running the WIC benchmark on each of them to see how the models improved over time as well and give us insight to where we want to use different prompts for different models. Using a prompt structure that has the model create an intermediary step before completing the task by including 10 examples in their context. The three methods were:
  • Textual: We listed the examples and context as lists
  • JSON: We listed the examples and context inside of JSON objects
  • Code: We declared an imaginary function that would take in the contexts and output the intermediary and result as JSON
Running text-davinci-003 with JSON did the best with 69% on the dev set, but we can see consistent improvement over time as OpenAI has improved their model.
ModelTextualJSONCode
Davinci (Original)0.600.570.55
text-davinci-0010.610.580.52
text-davinci-0020.390.630.65
text-davinci-0030.680.690.67
We also tested if we could get code-based 0 shot prompts to do it (We've been experimenting with pseudo code prompts), and it turns out text-davinci-003 can get 61% accuracy with zero examples. The three methods we tested were:
  • Full Reasoning Output: The model outputs a JSON object with the intermediary and result
  • Reasoning in Algorithm: The model outputs a JSON object with just the result, but we declare in the function that it uses the intermediary
  • No Reasoning: The model outputs a JSON object with just the result and does not attempt at calculating the intermediary
ModelReasoning in OutputReasoning in AlgorithmNo Reasoning
Davinci (Original)0.47--
text-davinci-0010.52--
text-davinci-0020.550.500.50
text-davinci-0030.610.600.59
We can see that 0 shot code prompts didn't work for the original Davinci and text-davinci-001, but there were gains on text-davinci-002 and text-davinci-003 (which were likely trained on top of codex). Additionally, there's hints for text-davinci-003 that including the reasoning in the algorithm can help over not doing so at all, but actually outputting the text worked best, although for text-davinci-002 it was random without explicitly printing the middle steps.
The Few Shot Prompts
We approach the Word in Context problem by generating intermediary results to help the model better distinguish the sense of the words in each sentence. While earlier models did better with pure text prompts, as more code (JSON, actual code) was added to the underlying model these more structured formats appear to have improved performance. Additionally, we have a separate set of examples for nouns vs verbs. The original davinci model was capable of producing above chance outputs using this method, but it's improved steadily over time.
Textual
You can run the notebooks and view our outputs at https://github.com/OthersideAI/WordInContext/tree/main/ReadablePrompt. Our textual prompts are simple prompts that just list the attributes of each example. It's pretty straightforward, and it looks like this:
Textual Prompt Example
Contexts: 'The French doors admit onto the yard .'; 'He admitted his errors .' Term: 'admit' Meaning: In the first sentence, 'admit' means to provide passage. In the second, 'admit' means to take responsibility. They are dissimilar Contexts: 'The company agrees to meet the cost of any repairs .'; 'Does this paper meet the requirements for the degree ?' Term: 'meet' Meaning: In the first sentence, 'meet' means to fulfill. In the second, 'meet' means to take fulfill. They are similar ... Contexts: <CONTEXT-1; CONTEXT-2>
Term: <TERM>
Meaning:
JSON
You can run the notebooks and view our outputs at https://github.com/OthersideAI/WordInContext/tree/main/JSONPrompt. The JSON prompts are just taking the original textual prompts and sticking them inside of a JSON object. The advantage here is it's easy to check if the output is correctly formatted because you can just call json.loads on it.
JSON Prompt Example
{ "Sense_1": "The French doors admit onto the yard .", "Sense_2":"He admitted his errors .", "Term": "admit", "Meaning_1": "In the first sentence, 'admit' means to provide passage.", "Meaning_2": "In the second, 'admit' means to take responsibility.", "Similar": true } { "Sense_1": "The company agrees to meet the cost of any repairs .", "Sense_2": "Does this paper meet the requirements for the degree ?", "Term": "meet", "Meaning_1": "In the first sentence, 'meet' means to fulfill.", "Meaning_2": "In the second, 'meet' means to fulfill.", "Similar": true } ... { "Sense_1": "<CONTEXT-1>", "Sense_2": "<CONTEXT-2>", "Term": "<TERM>", "Meaning_1":
Code
You can run the notebooks and view our outputs at https://github.com/OthersideAI/WordInContext/tree/main/CodePrompt. The code prompts are a variant of the JSON prompts. We declare what the output should look like and then pretend that we're running the prompt in a console.
Code Prompt Example
interface comparison{ "Sense1": str, // the sense of the word in the first context "Sense2": str, // the sense of the word in the second context "areSimilar": bool, // whether the two senses are similar } determineWordSimilarSense(word, context1, context2) : comparison =>{ return ai.compare(word, context1, context2) // return the comparison object } determineWordSimilarSense("The French doors admit onto the yard .", "He admitted his errors .", "admit") >>> { "Sense1": "In the first sentence, 'admit' means to provide passage.", "Sense2": "In the second, 'admit' means to take responsibility.", "Similar": true } determineWordSimilarSense("The company agrees to meet the cost of any repairs .", "Does this paper meet the requirements for the degree ?", "meet") >>> { "Sense1": "In the first sentence, 'meet' means to fulfill.", "Sense2": "In the second, 'meet' means to fulfill.", "Similar": true } ... determineWordSimilarSense(<CONTEXT-1>, <CONTEXT-2>, <TERM>) >>>
The Zero Shot Prompts
We found that the codex based models (text-davinci-002 and text-davinci-003) were both capable of zero shot word in context which is pretty cool. While we think few shot was needed for the original Davinci, this seems to be a bit of a nifty new development. Here's how it works.
Reasoning in Output
We started with taking the code few shot prompt and just dropping the few shots. This leaves us with an interface and a function that we then call. The interface tells the model that there's going to be a think-through step in the output and it goes ahead and does it. It doesn't work for the non-codex models though, which is unfortunate. You can run the notebooks and view our outputs at https://github.com/OthersideAI/WordInContext/tree/main/CodePrompt0Shot
Zero Shot Reasoning in Output Prompt
interface comparison{ "Sense1": str, // write out the dictionary meaning of the word in the first context "Sense2": str, // write out the dictionary meaning of the word in the second context "Similar": bool, // whether the two meanings of the word are similar or used to mean different things, should be true | false } determineWordSimilarSense(word, context1, context2) : comparison =>{ return ai.compare(word, context1, context2) // return the comparison object } //This returns the keys inside of double quotes ("KEY") so we can parse with JSON determineWordSimilarSense(<CONTEXT-1>, <CONTEXT-2>, <TERM>) >>>
Reasoning in Algorithm
We decided to go ahead and check what would happen if instead of actually printing the reasoning step we instead just had the model simulate doing it. It's pretty trippy if you think about it, but basically we're trying to activate the right parts of the neural network responsible for doing the reasoning but don't really care about seeing them. This drops the text-davinci-002 performance to chance, but text-davinci-003 does just slightly worse than if it had actually printed it. You can run the notebooks and view our outputs at https://github.com/OthersideAI/WordInContext/tree/main/CodePromptConditional
Zero Shot Reasoning in Algorithm
interface comparison{ "Similar": bool, // whether the two meaings of the word are similar or used to mean different things, should be true | false } determineWordSimilarSense(word, context1, context2) : comparison =>{ senseOfFirstWord = ai.computeSense(word, context1) // return the sense of the word in the first context senseOfSecondWord = ai.computeSense(word, context2) // return the sense of the word in the second context return ai.compare(word, senseOfFirstWord, senseOfSecondWord) // return the comparison object } //This returns the keys inside of double quotes ("KEY") so we can parse with JSON determineWordSimilarSense(<CONTEXT-1>, <CONTEXT-2>, <TERM>) >>>
No Reasoning
Finally we figured we ought to check if the reasoning step even helped anymore, so we went and removed the reasoning step from the code entirely. Text-davinci-002 was still random, and text-davinci-003 lost another percent. You can run the notebooks and view our outputs at https://github.com/OthersideAI/WordInContext/tree/main/CodePromptYolo
Zero Shot Code Without Reasoning
iinterface comparison{ interface comparison{ "Similar": bool, // whether the two meaings of the word are similar or used to mean different things, should be true | false } determineWordSimilarSense(word, context1, context2) : comparison =>{ return ai.compare(word, context1, context2) // return the comparison object for if the sense of thr word in both contexts is similar } //This returns the keys inside of double quotes ("KEY") so we can parse with JSON determineWordSimilarSense(<CONTEXT-1>, <CONTEXT-2>, <TERM>) >>>
What's Next
We're pretty interested in seeing where this continues to go. As we continue to add more capabilities to HyperWrite, we want to improve our ability to perform tasks involving comparing and contrasting results. We're pretty optimistic that it looks like calculating a result without actually printing it can help, although the gain was pretty small and there's a lot more to do there. That being said, many of our prompts already make use of this sort of conditional outputting so it's not unexpected.
It's pretty cool that text-davinci-002 and text-davinci-003 are able to perform better than chance with zero shot prompts. It's also promising that the performance has been increasing with each new model (except that one little guy with text-davinci-002 on the original prompt). It's likely there's more creative ways to get the right answer out of them which can be applied to other problems as well.
We probably need to go see how this applies to the other tasks that OpenAI had initially thought GPT-3 wouldn't be able to do - ANLI, or the ability to detect whether statements are contradictory.
Anyway, it's been a fun project. Still a lot out there to find out about how these things work!