Introduction

Before we begin, I want to point out that I am not an AI or LLM expert. My cursory knowledge of how these things work is from watching a few YouTube videos, browsing Wikipedia, and attending niche workshops.

The point of this short essay is to give you an ELI5 and ELI20 view of products like Copilot. It may contain a few syntactic inaccuracies (although I did run the text through Copilot itself to confirm I was generally correct).

Please feel free to leave comments if you have any clarifications or recommendations you’d like to add.

tl;dr

  1. Copilot is a product that uses the GPT series of Large Language Models (LLM’s) built by Microsoft and OpenAI.
  2. LLM’s are not “trained” on data we own, but rather a vast amount of information, mostly from the public internet.
  3. Copilot then “grounds” itself against on-demand to data we own, such as email, chats, and Office documents.
  4. The data that Copilot can ground against needs to be controlled through the use of “sensitivity labels”; otherwise it may reply with super-secret information.

Large-Language Models (LLM’s)

It’s important to understand that Large-Language Models (LLM’s) are the external thing that is actually trained. There are products that then use LLM’s, such as Microsoft Copilot and OpenAI ChatGPT. The LLM is a way for humans to interact with a computer, whether thru text or uploading pictures/videos/audio. Some common LLM’s that you may have heard of are:

  • GPT-3 and GPT-4, used by Microsoft Copilot and OpenAI ChatGPT
  • PaLM and Gemini, used by Google’s Gemini product
  • Grok, used by xAI and Twitter
  • LLaMA, used by Meta/Facebook

Please be aware that any data you own do not train the LLM further. Additionally, each prompt a user makes against the LLM does not further train it – they actually execute in their own “context”, so there is very little stateful knowledge that occurs between each prompt.

The LLM’s are somewhat static; they are retrained externally by Microsoft or OpenAI every few months against public information on the internet. As a result, the LLM is not live, meaning it may not be aware of current events… unless it grounds against them (more on that below).

Grounding

In the context of an enterprise, any prompts a user performs against Copilot are “grounded” against the data the organization owns. Grounding is a term used to describe how Copilot uses the private data not included in the LLM. In simple terms, think of it as Copilot “wiring itself up” to data sources on-demand, like when Neo learns Kung-Fu. The difference is that in our example, when Neo leaves the Matrix, he would “forget” everything he learned in that session. Remember, Copilot does not retain state from previous prompts.

In an enterprise environment, you would typically be grounding against:

  • Outlook emails
  • Teams chats
  • Word and Excel documents in SharePoint or OneDrive
  • Source code in Git repositories

To reiterate, the LLM is not retrained by any data owned by the actual enterprise organization; it is queried on-demand. Additionally, any prompts you make against data that only you have access to won’t magically become available to other users.

Governance

That leads us into how grounding is permissioned to private data. By default, Copilot will ground a prompt against whatever that user has access to. That means if you have a file sitting on your OneDrive called SocialSecurityNumbers.xlsx and ask Copilot “what is John Doe’s social security number”, it will respond with that info!!!

Administrators and users can label documents correctly to exclude things from Copilot groundings. This is done through Sensitivity Labels in SharePoint, OneDrive, and Outlook.

Quick aside – did you know that Copilot stores a user’s prompt in their Exchange mailbox under a hidden folder. This is so that administrators can leverage Purview’s compliance stuff under a single-pane-of-glass. If you need to do a Legal Hold on a user because you suspect that they are prompting Copilot for naughty things, the tools used in Purview search this hidden folder. If you work in an industry where you need to retain those prompts for a lengthy amount of time, I strongly recommend you speak to your backup vendor to confirm if they are backing this hidden folder up.

Leave a Reply

Your email address will not be published. Required fields are marked *