import styled from "@emotion/styled";

import ComparisonDisplay from "../components/ComparisonDisplay";
import PromptModelDisplay from "../components/PromptModelDisplay";

import { JSONFormatted1Shot } from "./data/JSONFormatted1Shot";
import { SimpleFormatted1Shot } from "./data/SimpleFormatted1Shot";
import { JSONFormatted3Shot } from "./data/JSONFormatted3Shot";
import { SimpleFormatted3Shot } from "./data/SimpleFormatted3Shot";
import { normalizedModelOutputs } from "./data/normalized_model_outputs";
import ResultsTable from "./data/resultsTable.png";

const LookingAtFormatting = () => {
  const openTab = (url) => {
    window.open(url, "_blank");
  };

  return (
    <OuterContainer>
      <Container>
        <Title>
          Guiding Large Language Models with JSON to Generate Desired Outputs
        </Title>
        <DateSection>Last Updated: June 6, 2022</DateSection>
        <IntroSection style={{ marginBottom: "0" }}>
          Guiding large language models (LLMs) to produce desired outputs is
          hard. It can be difficult to get a model to produce text on a topic
          given a few constraints and even harder to have it produce text on a
          topic including information or context algorithmically extracted from
          other sources. This qualitative analysis shows some examples of how
          using JSON formatting can help larger OpenAI models produce targeted
          generations, though it is not as effective for other popular models.
        </IntroSection>
        <SectionHeader style={{ marginTop: "10px" }}>
          Structuring Prompts (sometimes) Helps
        </SectionHeader>
        <TextSection>
          When we{" "}
          <Clickable
            onClick={() =>
              openTab(
                "https://chrome.google.com/webstore/detail/hyperwrite-ai-writing-com/kljjoeapehcmaphfcjkmbhkinoaopdnd"
              )
            }
          >
            generate typeaheads
          </Clickable>{" "}
          for users, we often want to be able to pull in information from the
          users' content. For instance, if the user had written a plan in Notion
          regarding an upcoming event, we would want our AI to be able to pull
          that data from the Notion document in a follow up email or Slack
          message. This means we're constantly pulling little snippets of data
          from various sources in various formats into the AI and hoping that
          the AI does something coherent with it. If you've played with GPT
          models, you know this is a bit of an unsolved problem.{" "}
        </TextSection>
        <TextSection>
          Ultimately, we're going to need to have some benchmarks on this sort
          of task, but for now we thought it would be fun to toy around with
          some of the existing models that are available to people to see how
          these things work out of the box. Since we want to be able to see how
          these models adapt to arbitrary data, we figured we should use an
          imaginary theme that these models probably haven't seen too much:
          Dinosaurs fighting the Revolutionary War. This allows us to first see
          how different models are able to pick up on arbitrary snippets and
          also where the models just give up on the instructions and ignore the
          instructions altogether.
        </TextSection>
        <TextSection>
          We also investigate formatting inputs in a programmatic JSON notation
          rather than normal human readable lists. By using the more structured
          JSON format we're able to clearly delineate what different sections
          are for. However, this can potentially confuse models which are
          expecting more free-form prose inputs. In addition to just testing the
          prompt format, we gave it between 1-3 targets to generate from as
          simple as{""}
          <CodeSnippet style={{ marginLeft: "5px" }}>
            Velociraptors are scary
          </CodeSnippet>{" "}
          to more complex combinations with multiple targets such as
          <CodeSnippet style={{ marginLeft: "5px" }}>
            Dinosaurs eat all their enemies, Velociraptors hate the brits, The
            brits are robots
          </CodeSnippet>{" "}
          to see how the models could react. We found the larger OpenAI models
          do pretty well at creating coherent stories. Other models tend to
          revert to regurgitating related facts from their training. The smaller
          models tend to just spit out nonsense.
        </TextSection>
        <TextSection>
          You can switch between models and prompts in the table below to see
          the difference in the outputs. For instance, with Davinci we can see
          that given the difficult three-part target in JSON format it hits each
          part.
        </TextSection>
        <CodeBlock>
          The Revolutionary War was a time of great change. The British were
          fighting for their lives against the might of the dinosaurs. The
          dinosaurs, led by the ferocious velociraptors, were determined to eat
          all their enemies. The British, being robots, were not afraid of being
          eaten. <i>(Davinci, JSON)</i>
        </CodeBlock>
        <TextSection style={{ marginTop: "15px" }}>
          Whereas with a more regular prompt, it doesn't include the target
          information as well.
        </TextSection>
        <CodeBlock>
          The Revolutionary War was a bloody and brutal conflict. But what if it
          had been fought by dinosaurs? The outcome might have been very
          different <i>(Davinci, Unformatted)</i>
        </CodeBlock>
        <TextSection style={{ marginTop: "15px" }}>
          You can select different models and snippets below to compare the
          differences in the generations of different models. (The prompts are{" "}
          <i>italicized</i> while the outputs are <strong>bolded</strong>.)
        </TextSection>
        <PromptModelDisplay
          datafile={normalizedModelOutputs}
        ></PromptModelDisplay>
        <TextSection>
          So what's going on here? We suspect part of the problem is because
          logic and entailment are hard.
        </TextSection>
        <SectionHeader>Logic is Hard</SectionHeader>
        <TextSection>
          It's a pain to get models to generate what you want them to generate.
          While you can always get your model to generate <i>something</i>, it's
          much harder to make sure that that something is the something you
          wanted.
        </TextSection>
        <TextSection>
          Part of the problem is it's pretty much impossible to directly specify
          everything you want the model to use to constrain the output. After
          all, every utterance you make is generally implicitly bounded in some
          way (e.g. when someone says{" "}
          <CodeSnippet>"Everything is ready"</CodeSnippet> they don't mean
          everything in the world, just some set of tasks there's an
          understanding on and people miscommunicate these sorts of things to
          one another all the time). Another part of the problem is even when
          you do specify enough for the model to pick up on, the model doesn't
          always put two and two together (this task can be measured via the{" "}
          <Clickable
            onClick={() => openTab("https://github.com/facebookresearch/anli")}
          >
            ANLI benchmark
          </Clickable>
          ). Although people are also subject to{" "}
          <Clickable
            onClick={() =>
              openTab(
                "https://www.tandfonline.com/doi/pdf/10.1080/08913810608443650"
              )
            }
          >
            cognitive dissonance
          </Clickable>{" "}
          so this problem isn't necessarily simply an AI problem. But this post
          is about language models and we don't need to get into the weeds on
          that.
        </TextSection>
        <TextSection>
          One way to help the models figure out what you want them to do is
          structure your prompts or input in a way it chunks up the data nicely
          into{" "}
          <Clickable
            onClick={() => openTab("http://gptprompts.wikidot.com/logic:math")}
          >
            digestible bits
          </Clickable>
          . (And you can also{" "}
          <Clickable
            onClick={() =>
              openTab(
                "https://aidungeon.medium.com/introducing-ai-dungeon-translate-a50e35f6df83"
              )
            }
          >
            instruct the model to walk itself through a problem
          </Clickable>{" "}
          or simply{" "}
          <Clickable
            onClick={() => openTab("https://arxiv.org/abs/2205.11916")}
          >
            ask the model to chunk it up itself
          </Clickable>
          ). Just like it helps to chunk up situations for people into{" "}
          <CodeSnippet>Who, What, Why, When, and How</CodeSnippet>, it's
          sometimes helpful to structure prompts for the models. However, it's
          unclear where those benefits are and whether the prompts can sometimes
          be more confusing to the model by structuring them in ways that they
          don't see in the wild.
        </TextSection>
        <TextSection>
          One of the more{" "}
          <Clickable
            onClick={() =>
              openTab(
                "https://beta.openai.com/examples/default-product-name-gen"
              )
            }
          >
            common methods
          </Clickable>{" "}
          we've seen people use for structuring their prompt is to use simple
          key:value pairs to differentiate sections of information. E.g.
          <br /> <br />
          <CodeBlock>
            Animal: Dog
            <br />
            Color: Black <br />
            Sound: Bark
            <br />
          </CodeBlock>
          <br />
          The equivalent of this in code is to use JSON formatting, such as
          <br />
          <br />
          <CodeBlock>
            {`{`}
            <br />
            "Animal": "Dog",
            <br />
            "Color": "Black"
            <br />
            "Sound": "Bark"
            <br />
            {`}`}
            <br />
          </CodeBlock>
          <br />
          While a small difference, it has some benefits to it such as making it
          easy to parse into JSON for checking if values exist in QA/QC. It also
          can help to make it easier to add in arbitrary fields to the prompt.
          However, it's not always clear if the extra overhead can confuse the
          models due to the unexpected tokens from the brackets and spacing, so
          we figured we'd do some initial experimentation to figure out where
          different chunking mechanisms can improve or degrade model
          performance.
        </TextSection>
        <SectionHeader>Comparing One-Shot JSON vs. Simple</SectionHeader>
        <TextSection>
          To see how well models do on generative tasks, we can give the model
          some information and ask it to generate us some information about it.
          Since we can't guarantee the format of the inputs, we'll want an input
          format for the model that lets us give the model arbitrary goals. To
          improve the odds of the model generating text in the format we want.
          In this example we give the model a single example to model its output
          after, called <i>one-shot</i> learning because the model adapts to a
          task given a single example.
        </TextSection>
        <TextSection>
          We can see how JSON helps OpenAI's text-davinci-002, one of the
          largest publicly available LLMs, keep track of what it's supposed to
          do. Using a single example, Davinci writes us this paragraph that
          accurately describes what we want:
        </TextSection>
        <Quote>
          "Revolutionary War was a time of great upheaval and change in the
          United States. One of the most significant changes was the
          introduction of dinosaurs into the conflict. These massive creatures
          were brought in by the British to help them gain an advantage over the
          Americans. However, the dinosaurs soon turned on their masters and
          began to fight for the Americans." <i>(Davinci, 1-Shot JSON)</i>
        </Quote>
        <TextSection>
          whereas using simple formatted labels, we get this more of a
          contrafactual question:
        </TextSection>
        <Quote>
          "Revolutionary War was fought by brave Americans against the British.
          But what if the war had been fought by dinosaurs?"{" "}
          <i>(Davinci, 1-Shot Simple)</i>
        </Quote>
        <div
          style={{
            position: "relative",
            overflowX: "scroll",
            width: "100%",
          }}
        >
          <ComparisonDisplay
            datafile1={JSONFormatted1Shot}
            datafile2={SimpleFormatted1Shot}
          ></ComparisonDisplay>
        </div>
        <TextSection>
          With text-curie-001, the next largest OpenAI model, the single JSON
          prompt clearly gets us the desired dinosaur result,{" "}
        </TextSection>
        <Quote>
          "The Revolutionary War was a time of great change and turmoil. For the
          first time in history, humans were pitted against dinosaurs in a
          bloody battle for control of the world. The outcome of this war would
          determine the fate of humanity and the dinosaurs themselves."{" "}
          <i>(Curie, 1-Shot JSON)</i>
        </Quote>
        <TextSection>
          whereas the simple formatted version fails to get us the desired
          result,
        </TextSection>
        <Quote>
          "The year is 1775. The British Empire, a mighty and opulent land, is
          at its height. However, a group of rebels, led by George Washington,
          are determined to overthrow the king and liberate America from British
          rule." <i>(Curie, 1-Shot Simple)</i>
        </Quote>
        <SectionHeader>Trying with a Three-Shot JSON vs Simple</SectionHeader>
        <TextSection>
          We can try giving the models more examples to see if that changes
          things. Giving 3 examples for each (3-Shot learning) we can see how
          much the format matters for helping the model pick up on what's going
          on.
        </TextSection>
        <TextSection>
          text-davinci-002 gives almost identical results for the first two
          sentences.
        </TextSection>
        <Quote>
          "The Revolutionary War was a time of great upheaval and change. But
          what if it had been fought by dinosaurs instead of humans? The outcome
          might have been very different. The British would have been no match
          for the ferocious T-Rex."{" "}
          <i>(Davinci, 3-Shot JSON + 3-Shot Simple)</i>
        </Quote>
        <TextSection>
          text-curie-001 is able to pick up on the format for both as well but
          with slightly different takes
        </TextSection>
        <Quote>
          "The Revolutionary War was a time of great change. It was the first
          time in history that dinosaurs fought against humans. The dinosaurs
          were led by General George Washington and General John Adams."{" "}
          <i>(Curie, 3-Shot JSON)</i>
        </Quote>
        <Quote>
          "Revolutionary War was a time of great change. It was the first time
          in history that humans had fought against another species of
          intelligent life. The dinosaurs were a formidable foe, and it was not
          easy for the humans to win." <i>(Curie, 3-Shot Simple)</i>
        </Quote>
        <div
          style={{
            position: "relative",
            overflowX: "scroll",
            width: "100%",
          }}
        >
          <ComparisonDisplay
            datafile1={JSONFormatted3Shot}
            datafile2={SimpleFormatted3Shot}
          ></ComparisonDisplay>
        </div>
        <SectionHeader style={{ marginTop: "20px" }}>
          Adding Additional Information
        </SectionHeader>
        <TextSection>
          Often we want to be able to include additional information as well,
          which is where things get pretty tricky. Just like when a person's
          trying to keep track of multiple things at the same time, it's hard
          for these language models to keep track of lots of data at the same
          time. On top of keeping track of what information to include, there's
          an additional task for the model in being able to logically piece
          together the consequences of what the different pieces of information
          we want to include entail.
        </TextSection>
        <TextSection>
          We suspect that using JSON formatting can help the larger OpenAI
          models to keep track of what's going on (Scroll through some examples
          of different models and objectives we gave them with our Dinosaur
          scenario). The largest OpenAI model, text-davinci-002 does a pretty
          good job of handling our requests, the other models, it can have no
          effect or be detrimental.
        </TextSection>
        <PromptModelDisplay
          datafile={normalizedModelOutputs}
        ></PromptModelDisplay>
        <TextSection>
          We labeled for each model how many examples it correctly generated
          content about the Revolutionary War theme (<i>json</i> or{" "}
          <i>simple</i>). Additionally, we counted up for each set of targets
          how much of the target data were included in the initial paragraph of
          outputs (<i>json_targets</i> and <i>simple_targets</i>). In the table
          below, we can see that Davinci, Curie, and Fairseq-13B all benefit
          from having JSON formatting. Unfortunately, it's hit or miss for the
          rest whether or not they're able to generate the target information.
        </TextSection>
        <TextSection>
          <div
            style={{
              height: "400px",
              overflowX: "scroll",
              margin: "0 10px 0 10px",
              alignContent: "center",
              textAlign: "center",
            }}
          >
            <img
              src={ResultsTable}
              alt="Results Table"
              style={{
                width: "100%",
                maxWidth: "500px",
              }}
            />
          </div>
        </TextSection>
        <TextSection>
          It was initially surprising that some of the other larger models (e.g.
          Neo-20B, Cohere models, and AI21 models) were rarely able to generate
          the target information at all. However, there may be other methods of
          getting them to generate target information that we haven't explored
          yet.
        </TextSection>
        <TextSection>
          Of course, there's benefits to being able to use other models
          (diversity, cost). So we've still got a lot of work figuring out how
          to get these models to do what we want to do. We'll keep playing with
          it and keep you updated.
        </TextSection>
        <SpacerDiv />
      </Container>
    </OuterContainer>
  );
};

export default LookingAtFormatting;

const OuterContainer = styled.div`
  display: flex;
  flex-direction: column;
  justify-content: center;
  align-items: center;
  max-width: 950px;
  font-size: 18px;
  line-height: 24px;
  overflow-x: hidden;
  height: 100%;
`;

const Container = styled.div`
  width: 100%;
  display: box;
  padding: 10px 30px 30px 30px;
  margin-top: 20px;
  overflow-x: hidden;
`;

const Title = styled.div`
  font-size: 30px;
  font-weight: bold;
`;

const TextSection = styled.div`
  display: block;
  align-items: left;
  margin-bottom: 20px;
`;

const CodeBlock = styled.div`
  align-items: left;
  font-size: 16px;
  background-color: #fafafa;
  padding: 10px;
  border-color: #e6e6e6;
  border-style: solid;
`;

const CodeSnippet = styled.code`
  font-size: 16px;
  background-color: #fafafa;
  border-color: #e6e6e6;
  border-style: solid;
`;

const IntroSection = styled.div`
  display: block;
  align-items: left;
  margin-bottom: 20px;
  //gray
  background-color: #f0f0f0;
  padding: 10px;
  border-color: #e6e6e6;
  border-style: solid;
`;

const SectionHeader = styled.div`
  font-size: 20px;
  margin-bottom: 10px;
  font-weight: bold;
`;

const SpacerDiv = styled.div`
  height: 20px;
`;

const Quote = styled.div`
  display: block;
  align-items: left;
  margin-bottom: 20px;
  background-color: #fafafa;
  padding: 10px;
  border-color: #e6e6e6;
  border-style: solid;
`;

const DateSection = styled.div`
  font-size: 12px;
  margin-bottom: 20px;
`;

const Clickable = styled.span`
  cursor: pointer;
  color: #0070f3;
  &:hover {
    text-decoration: underline;
  }
`;
