Qwen2.5-coder

Daestrum

Expert
Licensed User
Longtime User
Been playing with Qwen for the past week. Quite a fun AI and relatively easy to set up.
I have been using the 3B (3 Billion) parameter file, and the code is quite reasonable that it produces. (not too hot on B4X - keeps using vb.net in the output).

Its odd that for an AI meant for coding there is an awful lot of other data in there too.

I have it running under Python, and use, from a B4J UI app, websockets to send queries and get answers back.

The queries were taking around 78 seconds, but I realised I had not used cuda, when I changed it , the queries take around 3.5 seconds now.

Overall a fun experience. ( the reason I use the 3B file is my gpu only has 8GB memory and the next size up 7B forces the gpu to use system memory which is slow)

If anyone wants to try it, I can give you the requirements.txt file to set up python and or conda.
 

Daestrum

Expert
Licensed User
Longtime User
This is the entire Python code needed
Python:
import asyncio
import websockets
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import sys

# Define available models
models = {
    "0.5B": {"model_name": "Qwen/Qwen2.5-Coder-0.5B-Instruct"},
    "1.5B": {"model_name": "Qwen/Qwen2.5-Coder-1.5B-Instruct"},
    "3B": {"model_name": "Qwen/Qwen2.5-Coder-3B-Instruct"}
}

# Default model size
default_model_size = "1.5B"

# Set cache directory
cache_dir = "d:/.cache/huggingface/hub"

# Load model function
def load_model(size):
    model_info = models[size]
    model = AutoModelForCausalLM.from_pretrained(model_info["model_name"], cache_dir=cache_dir).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_info["model_name"], cache_dir=cache_dir)
    return model, tokenizer

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"Using {device}")

model, tokenizer = load_model(default_model_size)

# WebSocket handler
async def handler(websocket):
    global model, tokenizer, default_model_size

    try:
        async for message in websocket:
            if message.startswith("qwen use"):
                size = message.split()[-1]
                if size in models:
                    default_model_size = size
                    model, tokenizer = load_model(size)
                    response = f"Model switched to {size} parameters. Using {device}"
                else:
                    response = "Unknown model size."
                await websocket.send(response)
            elif message == "qwen shutdown":
                response = "Server is shutting down..."
                await websocket.send(response)
                await websocket.close()
                await asyncio.sleep(1)  # Allow time for messages to be sent
                sys.exit(0)  # Terminate the Python process
            else:
                text = tokenizer.apply_chat_template(
                    [{"role": "user", "content": message}],
                    tokenize=False,
                    add_generation_prompt=True
                )
                model_inputs = tokenizer([text], return_tensors="pt").to(device)
                with torch.no_grad():
                    generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
                generated_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
                response = generated_texts[0]
                await websocket.send(response)
    except websockets.ConnectionClosed as e:
        print(f"Connection closed: {e}")

# Main function to start the server
async def main():
    async with websockets.serve(handler, "localhost", 8765):
        await asyncio.Future()  # Run forever

if __name__ == "__main__":
    asyncio.run(main())

The B4J UI code

B4X:
Sub Process_Globals
    Private fx As JFX
    Private MainForm As Form
    Private xui As XUI
    Private Button1 As B4XView
    Private TextArea1 As B4XView
    Private Button2 As B4XView
    Private ws As WebSocketClient
    Private wsConnected As Boolean = False
    Private TextArea2 As B4XView
End Sub

Sub AppStart (Form1 As Form, Args() As String)
    MainForm = Form1
    MainForm.RootPane.LoadLayout("Layout1")
    MainForm.Show
    TextArea1.As(JavaObject).RunMethod("setWrapText",Array(True))
    ConnectWebSocket
End Sub

Sub ConnectWebSocket
    ws.Initialize("ws")
    ws.Connect("ws://localhost:8765")
End Sub

Sub ws_Connected
    AddOutput("Connected to server")
    wsConnected = True
    'ws.SendText("qwen use 3B")
End Sub

Sub ws_TextMessage (Text As String)
    AddOutput("Received from server: " & Text)
End Sub

Sub Button1_Click
    TextArea1.Text = ""
    Dim message As String = TextArea2.Text
    Log(wsConnected)
    If message.Length > 0 And wsConnected Then
        ws.SendText(message)
        AddOutput("Sent to server: " & message)
        TextArea2.Text = ""
        Else If Not(wsConnected) Then
        AddOutput("Connection is closed. Trying to reconnect...")
        ConnectWebSocket
    End If
End Sub

Sub Button2_Click
    ws.SendText("qwen shutdown")
    ws.Close
    AddOutput("Sent shutdown command and closed WebSocket connection")
    'MainForm.Close
End Sub

Sub AddOutput(msg As String)
    TextArea1.Text = TextArea1.Text & msg & CRLF
End Sub

Sub ws_Disconnected
    AddOutput("Disconnected from server. Attempting to reconnect in 5 seconds...")
    wsConnected = False
    Sleep(5000) ' Wait for 5 seconds before reconnecting
    ConnectWebSocket
End Sub
 

Daestrum

Expert
Licensed User
Longtime User
Slight change to the load_model function (it wasnt clearing previous model from gpu memory - giving OOM for GPU)
Python:
# Load model function
def load_model(size):
    global model, tokenizer

    #clear the previous model from memory
    if 'model' in globals():
       del model

    if 'tokenizer' in globals():
       del tokenizer

    torch.cuda.empty_cache()
    gc.collect()
 
    model_info = models[size]
    model = AutoModelForCausalLM.from_pretrained(model_info["model_name"], cache_dir=cache_dir).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_info["model_name"], cache_dir=cache_dir)
    return model, tokenizer

Also requirements.txt (for setting up environment)
 

Attachments

  • requirements.txt
    4.5 KB · Views: 5

Daestrum

Expert
Licensed User
Longtime User
For best performance - you need a cuda GPU (RTX etc) - it will run on just a CPU - but will be slow (as I found 78 sec compared to 3.5 sec on GPU).

Memory - 16GB minimum to be usable.

The instruct files - dont be over eager picking a big one. (The larger ones (72B) need 160+GB GPU memory - only found on server setups)
 
Top