LLM's and the importance of using RAG in the chain

alwaysbusy · Oct 11, 2024

I have been working on a library to use LLM's (like openAI) in B4J for our company. Using OpenAI is pretty simple to do with their REST API, but it quickly becomes quite costly as they not only make you pay for the output, but also for the input. In comes RAG (Retrieval Augmented Generation). It prepares your input data for the question you are going to ask, reducing the input tokens significantly. By introducing Query Routers in the RAG, the quality of the reduced input is also much better.

We can also add tools written in B4J to the chain. It allows the LLM to call, when necessary, one or more available tools, usually defined by the developer. A tool can be anything: a web search, a call to an external API, or the execution of a specific piece of code, etc. LLMs cannot actually call the tool themselves; instead, they express the intent to call a specific tool in their response (instead of responding in plain text). We, as developers, should then execute this tool with the provided arguments and report back the results of the tool execution.

Tools make the possibilities of LLMs endless.

And it has to be pretty simple to use my boss said...

~~In the example below we reduced the costs to make 2 calls to OpenAI from $0.603065 to only $0.017345, so we reduced the cost by approximately~~ ~~97.12%.~~
EDIT: My boss just pointed out I was wrong with my calculations, as for 2 separate questions we would actually send the input twice without RAG. So without RAG it would've cost $1.20613 and with RAG still only $0.017345, so a reduced cost of approximately 98.56%

An example of the use of our LLM library (I commented out the sources part here for readability of the B4J log, but they contain snippets of the input documents, relevant to the question):

B4X:

Dim question As String
   
Dim LLM As ABLLM
' We are going to use RAG (Retrieval Augmented Generation) in the chain.
' LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time
' that they were trained on. If you want to build AI applications that can reason about private data or data introduced after
' a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs.
LLM.Initialize(True)

' the OpenAI keys
LLM.OPENAI_APIKEY = "sk-proj-Qus6to4w01WofSs6z0_2g3tN4T7ekRkWYWuS3..."
LLM.OPENAI_ORGANIZATIONID = "org-OeNM4oDm..."

' path where to store the Vecktorized documents
LLM.STORE_PATH = "K:/ABLLMStore"

' some other parameters can be set (these are the defaults)
LLM.LOG_REQUESTS = False
LLM.LOG_RESPONSES = False

LLM.OPENAI_MODELNAME = "gpt-4o"
LLM.OPENAI_IMAGEMODELNAME = "dall-e-3"
LLM.OPENAI_TEMPERATURE = 0.7
LLM.OPENAI_MAXTOKENS = 512

LLM.RAG_MAX_RESULTS = 3
LLM.RAG_MIN_SCORE = 0.7

LLM.RESPONSE_MAX_MESSAGES = 3
   
LLM.SPLITTER_SEGMENT_SIZE = 300
LLM.SPLITTER_SEGMENT_OVERLAP = 0

Dim NoRAGTokens As Long
   
' adding two documents, one about The Matrix and one about BANano
' the RAG router will only use segments the ones that are relevant to the question, reducing the input tokens significantly.
' You pay OpenAI for the number of tokens you input and output so reducing them is important
NoRAGTokens = NoRAGTokens + LLM.AddDocument("https://www.onetwowork.com/usecases", "description of use cases of One Two Work", "Work2020", False) ' the file is 108KB, or about 26.562 tokens.
NoRAGTokens = NoRAGTokens + LLM.AddDocument("K:\TheMatrix.txt", "description of the movie The Matrix", "Matrix", False) ' the file is 108KB, or about 26.562 tokens.
NoRAGTokens = NoRAGTokens + LLM.AddDocument("K:\BANanoEssentialsV1.09.pdf", "description of the programming language BANano for B4J", "BANano", False) ' the file is 390KB, or about 92.971 tokens.
NoRAGTokens = NoRAGTokens + LLM.AddDocument("K:\TeamCup-Ranking.xlsx", "scoring of the Team cup", "TeamCup", False)
NoRAGTokens = NoRAGTokens + LLM.AddDocument("K:\OpenAISources\B4xBasicLanguageV1_2.docx", "description of the B4J language", "B4J",False)

' load the store
NoRAGTokens = LLM.LoadStore

' you can add tools to the chain that do tasks or return extra information to the LLM in the chain.
Dim MyTools As AllMyTools
MyTools.Initialize

' the parameters we want the LLM to retun to our function. They are always in json format.
Dim ParamDescriptions As Map
ParamDescriptions.Initialize
ParamDescriptions.Put("MovieName", "the name of the movie")
ParamDescriptions.Put("Year", "the year the movie was released if it is known")
ParamDescriptions.Put("BoxOffice", "the box office results if it is known")

' describe what the tool does. If it is relevant to the question, it will be used.
LLM.AddTool(MyTools, "MyMovieTool", ParamDescriptions, "saves some details about the movie in a database")

' start the LLM chain
LLM.Start
Log("Chain started...")

Dim AllTokens As List
AllTokens.Initialize

' describe an image
' question = "What do you see?"
question = "Read barcodes"
Log("QUESTION: " & question)
Log("------------------------------------------------------------------------")
Dim result As ABLLMResult = LLM.QueryImage(question, "K:\barcodes.png", False)
Dim VisionTokens As TokensUsed = LogResult(result, False)
If VisionTokens.Input > 0 Then
    AllTokens.Add(VisionTokens)
End If

' generate an image
question = "Donald Duck in New York, cartoon style"
Log("QUESTION: " & question)
Log("------------------------------------------------------------------------")
Dim result As ABLLMResult = LLM.GenerateImage(question, False)
Dim ImageTokens As TokensUsed = LogResult(result, False)
If ImageTokens.Input > 0 Then
    AllTokens.Add(ImageTokens)
End If
   
question = "Summarize the use cases of One Two Work"
Log("QUESTION: " & question)
Log("------------------------------------------------------------------------")
Dim result As ABLLMResult = LLM.Chat(question, True, False)
Dim Q1Tokens As TokensUsed = LogResult(result, False)
If Q1Tokens.Input > 0 Then
    AllTokens.Add(Q1Tokens)
End If
   
' this will only use an extract of the text segments (by using the RAG) of The Matrix file
' reducing Token cost significantly. Because we ask it to save some details, the MyMovieTool function will be called.
question = "What is the plot of the movie The Matrix and save the details."
'question = "Give Matrix (film) plot and save"
Log("QUESTION: " & question)
Log("------------------------------------------------------------------------")
Dim result As ABLLMResult = LLM.Chat(question, True, False)
Dim Q2Tokens As TokensUsed = LogResult(result, False)
If Q2Tokens.Input > 0 Then
    AllTokens.Add(Q2Tokens)
End If

' this will only use an extract of the text segments (by using the RAG) of the BANano file
' reducing Token cost significantly
question = "What is BANano?"
Log("QUESTION: " & question)
Log("------------------------------------------------------------------------")
Dim result As ABLLMResult = LLM.Chat(question, True, False)
Dim Q3Tokens As TokensUsed = LogResult(result, False)
If Q3Tokens.Input > 0 Then
    AllTokens.Add(Q3Tokens)
End If

Dim TotalINTokens As Long
Dim TotalOUTTokens As Long
Dim NumOfRAGQuestions As Long = AllTokens.Size
For i = 0 To NumOfRAGQuestions - 1
    Dim tmpTokens As TokensUsed = AllTokens.Get(i)
    TotalINTokens = TotalINTokens + tmpTokens.Input
    TotalOUTTokens = TotalOUTTokens + tmpTokens.Output
Next

Dim NoRAG As Long = NoRAGTokens * NumOfRAGQuestions + TotalOUTTokens
Dim WithRAG As Long = TotalINTokens * NumOfRAGQuestions + TotalOUTTokens

Dim PriceNoRAG As Double = (NoRAGTokens * NumOfRAGQuestions * 5.0/1000000.0) + (TotalOUTTokens * 15.0/1000000.0)
Dim PriceWithRAG As Double = (TotalINTokens * 5.0/1000000.0) + (TotalOUTTokens * 15.0/1000000.0)

LogError("Without RAG, asking these " & NumOfRAGQuestions & " questions would've used " & NoRAG & " of our allowed tokens, cost: $" & PriceNoRAG)
LogError("With RAG, asking these " & NumOfRAGQuestions & " questions only used " & WithRAG & " our allowed tokens, cost: $" & PriceWithRAG)

The code of the LogResult() method

B4X:

public Sub LogResult(result As ABLLMResult, showSources As Boolean) As TokensUsed
    Dim tokens As TokensUsed
    tokens.initialize
       
    If result.ResultType = result.RESULT_FAILED Then
        Log("Question (or shortened question is empty?")
    Else
        If result.InputTokenUsage > 0 Then
            Log("INPUT: " & result.InputTokenUsage & " after RAG")
            tokens.Input = result.InputTokenUsage
            tokens.Output = result.OutputTokenUsage
            Log("OUTPUT: " & result.OutputTokenUsage)
            Log("TOTAL: " & result.TotalTokenUsage)
        End If
        If showSources Then
            'Shows the result of the RAG And what TextSegments were actually send To OpenAI
            Dim sources As List = result.Sources
            If sources.Size > 0 Then
                Log("SOURCES USED:")
                For i = 0 To sources.Size - 1
                    Log(sources.Get(i))
                Next
            End If
        End If
        Log("FINISHED REASON: " & result.FinishedReason)
        Log("------------------------------------------------------------------------")
        Log("ANSWER:")
        Log(result.Text)
        Log("========================================================================")
    End If
   
    Return tokens
End Sub

The code of MyMovieTool in the AllMyTools class:

B4X:

public Sub MyMovieTool(jsonString As String) As String
    Log("In the MyMovieTool, parameter returned by OpenAI: " & jsonString)
    Log("------------------------------------------------------------------------")
    ' Here you can save it in a database for example
    ' ...
 
    ' Tell the chain we're done so it can continue answering the question
    Return "The details are now saved in the database"
End Sub

The result:

B4X:

https://www.onetwowork.com/usecases skipped. Already found in Work2020
K:\TheMatrix.txt skipped. Already found in Matrix
K:\BANanoEssentialsV1.09.pdf skipped. Already found in BANano
K:\TeamCup-Ranking.xlsx skipped. Already found in TeamCup
K:\OpenAISources\B4xBasicLanguageV1_2.docx skipped. Already found in B4J
Store Work2020 loaded...
Store Matrix loaded...
Store TeamCup loaded...
Store BANano loaded...
Store B4J loaded...
Chain started...
QUESTION: Read barcodes
------------------------------------------------------------------------
FINISHED REASON: N/A
------------------------------------------------------------------------
ANSWER:
The image contains various types of barcodes and corresponding types are labeled above each barcode. Below each barcode, the numbers or data encoded in the barcode are displayed as text. Here are the details:
1. **DataMatrix**: This barcode is a 2D matrix type, and the data it encodes is not visible in text form in this specific barcode.
2. **QR Code**: Another 2D barcode type, which also does not show the data it encodes visibly here.
3. **PDF417**: A type of stacked barcode; the data it encodes is not visible here.
4. **Codabar**: Encodes the numbers "0123456789".
5. **Code11**: Encodes the numbers "0123456789".
6. **Code25Standard**: Encodes the numbers "01234567".
7. **Code25Interleaved**: Encodes the numbers "0123456789".
8. **Code39**: Encodes the numbers "0123456".
9. **Code93**: Encodes the numbers "0123456789".
10. **Code128**: Encodes the numbers "0123456".
11. **Code39Extended**: Encodes the numbers "0123456".
12. **Code93Extended**: Encodes the numbers "0123456789".
Each barcode type is utilized for different applications based on the amount of data they can store and the environment in which they will be used.
========================================================================
QUESTION: Donald Duck in New York, cartoon style
------------------------------------------------------------------------
FINISHED REASON: N/A
------------------------------------------------------------------------
ANSWER:
https://oaidalleapiprodscus.blob.core.windows.net/private/org-OeNM4oDmnnq1QgOo9zdoNhKX/user-7b4mRza9oK9qZB13SyFe891w/img-ybwn8O48Wc3Jqpc0yYW6EF1W.png?st=2024-10-17T08%3A51%3A50Z&se=2024-10-17T10%3A51%3A50Z&sp=r&sv=2024-08-04&sr=b&rscd=inline&rsct=image/png&skoid=d505667d-d6c1-4a0a-bac7-5c84a87759f8&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2024-10-16T23%3A17%3A48Z&ske=2024-10-17T23%3A17%3A48Z&sks=b&skv=2024-08-04&sig=rJ/4tbDHuDvna65VQgVf2CkUNwCkci1wk0BrZ%2BNll54%3D
========================================================================
QUESTION: Summarize the use cases of One Two Work
------------------------------------------------------------------------
INPUT: 791 after RAG
OUTPUT: 67
TOTAL: 858
FINISHED REASON: STOP
------------------------------------------------------------------------
ANSWER:
One Two Work is a time recording application that provides users with valuable insights into the hours spent on various tasks, enabling them to make informed decisions. The platform offers a series of real-life examples demonstrating its practical applications in different scenarios. By accurately tracking time, One Two Work helps users optimize their productivity and manage their time more effectively.
========================================================================
QUESTION: What is the plot of the movie The Matrix and save the details.
------------------------------------------------------------------------
In the MyMovieTool, parameter returned by OpenAI: {"MovieName":"The Matrix","Year":"1999","BoxOffice":"$467.6 million"}
------------------------------------------------------------------------
INPUT: 2418 after RAG
OUTPUT: 168
TOTAL: 2586
FINISHED REASON: STOP
------------------------------------------------------------------------
ANSWER:
**The Matrix** is a 1999 science fiction action film written and directed by the Wachowskis. It stars Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, and Joe Pantoliano. The film is set in a dystopian future where humanity is unknowingly trapped inside the Matrix, a simulated reality created by intelligent machines to distract humans while using their bodies as an energy source. The protagonist, Thomas Anderson, also known as "Neo," is a computer programmer who discovers the truth and joins a rebellion against the machines with other people who have been freed from the Matrix.
The details of the movie have been saved in the database.
========================================================================
QUESTION: What is BANano?
------------------------------------------------------------------------
INPUT: 970 after RAG
OUTPUT: 330
TOTAL: 1300
FINISHED REASON: STOP
------------------------------------------------------------------------
ANSWER:
BANano is a set of methods and tools designed to assist in writing web code using B4J. It includes a Transpiler, which is a key component that helps convert B4J code into web-compatible code. BANano provides various objects and features that mimic typical JavaScript functionalities, allowing developers to write web applications in B4J while still having access to JavaScript-like capabilities.
Some key features of BANano include:
1. **Transpiler**: This is used to convert B4J code into web-compatible code. The `AppStart` method is unique as it is not transpiled and runs in the B4J IDE like regular B4J code, allowing developers to set directions for building a web project.
2. **JavaScript and CSS Writing**: BANano allows developers to write raw JavaScript and CSS directly within their B4J code, providing flexibility for custom solutions when needed.
3. **BANano Objects**: These include a variety of JavaScript-like objects such as `BANanoConsole`, `BANanoWindow`, `BANanoHistory`, `BANanoLocation`, `BANanoNavigator`, `BANanoScreen`, and more. These objects offer methods and properties typical of JavaScript, enabling developers to perform actions akin to what they would in a JavaScript environment.
4. **Special Features**: BANano also includes special components like Background Workers, Router, and WebComponent, enhancing its capability to build complex web applications.
Overall, BANano serves as a bridge between B4J and web development, allowing developers to leverage B4J's strengths while still creating robust web applications.
========================================================================
Without RAG, asking these 3 questions would've used 529975 of our allowed tokens, cost: $2.655525
With RAG, asking these 3 questions only used 13102 our allowed tokens, cost: $0.02937

The input barcodes.png to read the barcodes from:

The output Donald Duck:

It is a fascinating world A.I. and a lot more to explore, but we're getting the hang of it

Alwaysbusy

hatzisn · Oct 11, 2024

You have the open AI API Key and organization visible. I hope you have changed them yourself and not forgot to change them/remove them.

alwaysbusy · Oct 11, 2024

OMG @hatzisn, thank you so much for telling me this! A really hope everyone here is so honest and whoever did see it will not use them. Thanks you a million times !!!

I disabled the key just to be sure and made a new one.

Daestrum · Oct 11, 2024

I do think a lot of the charges we generate ourselves, when we phrase a question as if we were talking to a person.
Your example "QUESTION: What is the plot of the movie The Matrix and save the details" would probably work equally well as "Matrix (film) plot and save".

alwaysbusy · Oct 11, 2024

@Daestrum you are right. Prompt optimization is our next step to look into.

alwaysbusy · Oct 11, 2024

@Daestrum It's answer to 'Matrix (film) plot and save' was:

B4X:

In the MyMovieTool, parameter returned by OpenAI: {"MovieName": "The Matrix", "Year": "1999", "BoxOffice": "$467.2 million"}
-------------------------------
The plot details and information about the movie "The Matrix" have been saved successfully. If you need more information or further assistance, feel free to ask!

What is not exactly what we want.

But it did it with: 'Give Matrix (film) plot and save':

B4X:

In the MyMovieTool, parameter returned by OpenAI: {"MovieName": "The Matrix", "Year": "1999", "BoxOffice": "$467.2 million"}
-------------------------------
"The Matrix" is a science fiction film released in 1999. The plot revolves around a dystopian future where humanity is unknowingly trapped inside a simulated reality called the Matrix, created by intelligent machines to distract humans while using their bodies as an energy source. Neo, a computer hacker, discovers the truth about the Matrix and joins a group of rebels to fight against the machines and free humanity. The movie is known for its groundbreaking special effects, philosophical themes, and stylized action sequences. It was a box office success, grossing $467.2 million worldwide.

We think opennlp might be of help here to reduce prompts to their essence.

Daestrum · Oct 11, 2024

Fun subject - use an in-house LLM to reduce the questions to their essence for the paid one.

alwaysbusy · Oct 11, 2024

Daestrum said:
Fun subject - use an in-house LLM to reduce the questions to their essence for the paid one.

I think they kind of anticipated that when I did some tests

hatzisn · Oct 11, 2024

Hi, this is incredible work. Thank you very much for the provided knowledge. I googled around RAG and found some useful resources and understood what you have written to the most of it.

I have though some questions for you :

1) Is it a plantuml created graph the graph in the first message? If yes can you provide the code for it in a txt atouched to your response because I am learning it right now?
2) Obviously you feed the initial query to the query transformer (obviously an LLM - ???? - this is a part of the question) and you understand the message and what the user has asked for in their message. This categorizes obviously the things asked and feeds them to the query router which feeds the corresponding queries to the corresponding fetchers/retrievers to get the information needed obviously from an on-line or local resource like an API, a file or something like that. Is this correct so far? Please, would it be possible for you to answer all the bold parts of this question?
3) The content you retrieve you add it to the content aggrigator which aggrigator's output is possible a bunch of jsons (Is this correct?) and the content injector probably creates a batch file for the openAI which will contain the initial question asked json and the aggrigator's output and post this file to OpenAI poking them every X seconds to check if the response is ready. Is this correct?

Sorry, I am wrong, I have figured it out...

Daestrum · Oct 11, 2024

CoPilot will actually tell you what elements it looks at in a query (if you ask it).

alwaysbusy · Oct 17, 2024

Lots of new stuff added this weekend:

1. Loading documents from Office (doc, docx, xls, xlsx, ppt, pptx, pub, vis, ...), PDFs, websites
2. Added a store where all the vectorized version of a document are saved so the next time the engine has to be started it does not have to buildup the RAG again.
3. Added the OpenAI Vision API for image-to-text (e.g. the Barcode test)
4. Added the Open AI API to generate images

LLM's and the importance of using RAG in the chain

alwaysbusy

Expert

hatzisn

Expert

alwaysbusy

Expert

Daestrum

Expert

alwaysbusy

Expert

alwaysbusy

Expert

Daestrum

Expert

alwaysbusy

Expert

hatzisn

Expert

Daestrum

Expert

alwaysbusy

Expert