Implementing Learnings from the "Mastering LLMs" Course

How I have improved an LLM in production using insights of the Mastering LLMs course

Energized by the course

This course is giving me so much energy to write better algorithms. After Bryan Bischof's talk specifically, all I wanted to do was look at my data, write basic evals, and improve an LLM in production the right way.

But writing evaluations is boring; we always want to swap in a new LLM, write a more verbose prompt, and “feel” the results by looking at them. But we can deterministically write an evaluation metric that can capture the quality of a text output without doing too much.

By luck, it just so happened that I own a website that has been performing LLM predictions for the past year. More on that later. And it just so happened that I was logging those prediction events, just in case I needed them in the future.

Let's see what we can do with them and try to improve an LLM in production.

44779 prompts in my production system. Quite a lot! Save your production events for future you

Legacy System: My Old LLM in prod

I own a website that provides coding exercises to students, accompanied by instructional videos. ( don’t visit my site if you don’t speak french 🙂 )

Last year, I decided to have all error messages generated by LLMs. It’s extremely challenging to programmatically generate an error message based on a user’s input, especially when the exercise requires specific syntax.

Generating a specific test for each exercise would require visiting the abstract syntax tree of the user-generated code, performing a tree comparison with the expected solution, addressing all edge cases, and so on.

This problem is not easily solvable with traditional software engineering.

So last year, I decided to migrate everything and use GPT-4. Now, an LLM determines if a student has completed an exercise and if they should move on to the next one.

The system was quite simple:

I used response_format=json, returning only two keys: is_valid and message_to_user.

My prompt was basic. I generally instructed it to send False if something was missing with respect to the code solution and to describe what needed to be changed or added.

I tested this system myself on 5-10 exercises and was quite satisfied with it. I pushed it to production and called it a day. That was a year ago.

In short:

  • Context: expected code, code submitted, instructions of the exercice

  • system content: basic guidances on what to return

2023 LLM prompt system: context + system content + JSON output

From a business’s perspective, everything has been improved: the old correction system was worse, replaced by a new system that is failing much less ofter, Much less human interventions, and 1 prompt used for all of the exercices in the platform.

But are we still returning bad messages? Are we monitoring them? How to improve from there? Those questions will eventually needs to be asked if we want to improve the product, and the sooner the better. Let’s try to apply what we have learned.

Building and Analyzing Evals

One of the main learnings from this talk was that we can build numerous binary evals for any problem or use case.

Building evaluations is specific to each use case. In my case, how can I check the quality of feedback on code?

First, we should make assumptions. What's the difference between is_valid=True feedback and is_valid=Falsefeedback?

Huh.. maybe the is_valid=False feedback is asking you to change something? While the is_valid=True feedback is just praising you? Sounds like a good hypothesis, right? Can we programmatically verify that?

Here enters a new technology.. keywords!

Just kidding.. evals can be as simple as keywords matching as we will see here.

I downloaded all of the LLM output messages, and by reading them, I noticed that certain words frequently appear in the error messages:

keywords_in_error = [
        "Supprime", "Sépare", "Essayons", "mais", "Assure", "Essaye", 
        "Pense", "améliore", "Ajoute", "Remplace", "Retire", "Utilise", 
        "Corrige", "Déplace", "attention", "Change", "cependant", 
        "Attention", "Enlève", "ajoute", "respecter", "Modifie", 
        "Réduis", "Réarrange", "Adopte", "Renomme", "veille"
    ]

These keywords prompt the user to take an action: delete, split, try, add, respect, rename, reduce, rearrange... you get the idea.

The workflow looks like this:

Workflow of users failing and completing exercices

Code incorrect → change something → code correct → praise → next exercise.

While the success workflow, much simpler, should look like this:

Code correct → praise → next exercise.

All of these keywords, which ask the user to change something, should only be present when is_valid=False, right? Is it true in my case? Yes or no? Sounds like a simple binary classification problem! Let’s build my first eval:

Count the number of attempts having a bad keyword, while showing a success message

I just count the number of attempts that contain a keyword in a success message.

And what did I learn? 26% of the success messages contained at least one keyword that should only be present in the error messages. That was a big surprise... LLM was not doing what I expect him to do, and I did not have a way to verify that. But now I have at least 1 way!

Other binary evals

My brain started seeing things differently. Everything is a binary classification! That was the secret of the Matrix I just discovered.

Then I decided to write more evaluations like this:

  • Did I have more than one sentence in the error (while asking not to)?

  • Did I have variables inside ticks if is_success=True?

  • Did I return an error message without any ticks?

  • If code_start=code_solution, is the code always failing?

  • Did I send a really big error message, more than 250 characters?

  • Is the code even compiling if is_valid=True?

More binary evals: only 10 lines of code

For each failing evaluation, I looked at the data again, following this process:

The magic cycle of Eval

I repeated this loop a few times, with the goal of minimizing one eval without negatively impacting the others. There is probably a better process, but this is good enough for me and the current state of my project.

To save money, I subsampled my data to use only about 1,000 attempts when re-running this eval pipeline. I plan to use more attempts later on when I will be fixing more specific bugs.

Current State: Transition to Function Calling + Multiple feedbacks per attempt

Some of the evaluations were difficult to minimize. How can an LLM return feedback in one sentence when the student is making too many mistakes? I decided to return a list of feedback items, where each item is only one change to be made to the code.

Additionally, I decided to move into function calling and use the Instructor library for that. My code is now more readable, I can define my schema in code, and I can enforce a specific schema more strictly on my LLM.

One reason I avoided using function calling before was to make things faster. Function calling uses more tokens than returning a simple JSON because of the syntax structure.

Latency was a big issue about a year ago, and my query time was around 10 seconds, making the UX quite bad. I was trying to save as many tokens as possible.

But now that predicting tokens is quick, and relative to an API round trip time, it’s negligible, I can afford to generate a few more syntax tokens by using function calling and predict more feedback without worrying much about latency. I still monitor it just in case, but it no longer drives my decisions.

New feedbacks: A list of things to change instead of only one. This is improving my evals!

My Conclusion

The Mastering LLM course has been an eye-opener for me, stuffed with insights from multiple LLM professionals.

But since this technology is still new and best practices are not widely established, it's literally impossible to get everything right on the first try. We need to learn from other's mistake.

Also, always save your predictions done in production environment, you never know when you will need them. And make the export as easy as possible! I spent more than an entire day just to extract all of the columns needed to run my evaluations correctly

And This course was the perfect place for that for me. Excited to see where it takes me next!