B4A Library LlamaEngine - Local LLMs

Blueforcer · Friday at 12:57 PM

LlamaEngine - Run LLM Models Locally on Android

Run large language models (LLMs) directly on your Android device - no internet, no API keys, no cloud services required!

LlamaEngine wraps llama.cpp into a native B4A library, allowing you to load and run GGUF model files with streaming token generation on-device.

If you like my work, please consider supporting the me.

https://paypal.me/blueforcer

Features

GGUF Support: Load any compatible GGUF model (Qwen, Llama, Gemma, Phi, Mistral, etc.)
Streaming Generation: Receive events as each token is generated in real-time
Chat Templates: Built-in support for ChatML, Llama 3, Gemma, Alpaca, and Raw modes
Multi-turn Chat: History management (System, User, Assistant messages)
Configurable Sampling: Fine-tune Temperature, Top-P, Top-K, Min-P, and Repeat Penalty
Low Memory Mode: Dedicated toggle (LowMemoryMode) for devices with limited RAM (e.g., 2GB or 32-bit devices)
Debugging: Enable DebugMode for detailed performance metrics and logging via _DebugLog event
Completely Offline: Runs entirely on the device CPU
Architecture Support: Supports both modern 64-bit (arm64-v8a) and older 32-bit (armeabi-v7a) Android devices

Requirements

Android 7.0+ (API 24)
Supported Architectures: arm64-v8a, armeabi-v7a
A GGUF model file downloaded to your device storage

Installation
The library is split into a Java wrapper and an AAR containing the compiled C++ binaries.

Download the release archive containing:
- LlamaEngine.jar
- LlamaEngine.xml
- LlamaEngine_Native.aar
Copy all three files to your B4A Additional Libraries folder.
In B4A IDE: Right-click the Libraries Manager -> Refresh.
Check "LlamaEngine".

Basic Example

B4X:

Sub Globals
    Dim llm As LlamaEngine
End Sub

Sub Activity_Create(FirstTime As Boolean)
    llm.Initialize(Me, "LLM")

    ' Optional: Enable debug logs and low memory optimizations
    llm.DebugMode = True
    ' llm.LowMemoryMode = True ' Uncomment for low RAM devices

    ' Load a GGUF model (adjust path to your model file)
    ' Params: Path, Threads, ContextSize
    llm.LoadModel("/sdcard/Download/Qwen3-0.6B-Q4_K_M.gguf", 4, 512)
End Sub

Sub LLM_LoadProgress(Progress As Float)
    Log("Loading: " & Round(Progress * 100) & "%")
End Sub

Sub LLM_ModelLoaded(Success As Boolean)
    Log("Model loaded: " & Success)
    If Success Then
        llm.AddSystemMessage("You are a helpful assistant.")
        llm.AddUserMessage("What is the capital of France?")
        llm.GenerateChat(256, 0.7) ' Generate max 256 tokens with 0.7 temperature
    End If
End Sub

Sub LLM_TokenGenerated(Token As String)
    ' Each token is streamed as it's generated
    Log(Token)
End Sub

Sub LLM_Complete(Response As String)
    Log("Full response: " & Response)
End Sub

Sub LLM_Error(Message As String)
    Log("Error: " & Message)
End Sub

Sub LLM_DebugLog(Message As String)
    ' Triggered if llm.DebugMode = True
    Log("[LLM DEBUG] " & Message)
End Sub

Chat Templates
Different models use different prompt formats. Set the template before generating:

B4X:

' For Qwen, Yi, Mistral-Instruct (default)
llm.ChatTemplate = "chatml"

' For Llama 3 / 3.1
llm.ChatTemplate = "llama3"

' For Gemma / Gemma 2
llm.ChatTemplate = "gemma"

' For Alpaca-style fine-tuned models
llm.ChatTemplate = "alpaca"

' No template - you build the prompt yourself
llm.ChatTemplate = "raw"

Sampling Parameters
Fine-tune the text generation:

B4X:

llm.TopP = 0.9            ' Nucleus sampling (0.0-1.0)
llm.TopK = 40             ' Top-K sampling (0 = disabled)
llm.RepeatPenalty = 1.1   ' Penalize repetitions (1.0 = off)
llm.MinP = 0.05           ' Min-P filtering (0.0-1.0)

Multi-Turn Conversation

B4X:

' Build a conversation
llm.ClearHistory
llm.AddSystemMessage("You are a cooking assistant.")
llm.AddUserMessage("How do I make pasta?")
llm.GenerateChat(512, 0.7)

' After getting the response, continue the conversation:
Sub LLM_Complete(Response As String)
    llm.AddAssistantMessage(Response)  ' Save the AI's response
    llm.AddUserMessage("What sauce goes best with it?")
    llm.GenerateChat(512, 0.7)         ' Generate next response
End Sub

Raw Prompt Mode
For full control over the prompt format without modifying chat history:

B4X:

llm.GenerateRaw("Once upon a time,", 256, 0.8)

API Reference

Methods:

Method	Description
Initialize(CallBack, EventName)	Initialize the engine
LoadModel(Path, Threads, ContextSize)	Load a GGUF model file asynchronously
AddSystemMessage(Text)	Add system instruction to chat history
AddUserMessage(Text)	Add user message to chat history
AddAssistantMessage(Text)	Add AI response to chat history
ClearHistory	Clear all chat messages
GenerateChat(MaxTokens, Temperature)	Generate from chat history
Generate(SystemPrompt, UserPrompt, MaxTokens, Temp)	Single-turn generation without history
GenerateRaw(Prompt, MaxTokens, Temperature)	Generate from raw unformatted prompt
StopGeneration	Stop current generation safely
Unload	Free model and RAM memory

Properties:

Property	Type	Default	Description
IsLoaded	Boolean	-	True if a model is currently loaded
IsGenerating	Boolean	-	True if text generation is running
DebugMode	Boolean	False	Enables detailed logging and _DebugLog events
LowMemoryMode	Boolean	False	Reduces compute buffer allocation for low RAM devices
ChatTemplate	String	"chatml"	Prompt template format
TopP	Float	0.9	Nucleus sampling
TopK	Int	40	Top-K sampling
RepeatPenalty	Float	1.1	Repetition penalty
MinP	Float	0.05	Min-P filtering
HistorySize	Int	-	Number of chat messages currently in history

Events:

Event	Description
_ModelLoaded(Success As Boolean)	Model loading finished
_LoadProgress(Progress As Float)	Loading progress (0.0-1.0)
_TokenGenerated(Token As String)	Emitted for each token as it's generated
_Complete(Response As String)	The fully combined response string
_Error(Message As String)	An error occurred
_DebugLog(Message As String)	Performance and debug info (requires DebugMode = True)

Recommended Models
You can download .gguf files directly from HuggingFace. Models must be stored on your device storage (e.g. /sdcard/Download/).

Model	Size	Quality	Speed	Link
Qwen3.5-0.8B-Instruct-Q4_K_M	~600 MB	Great for simple tasks	Very Fast	Download
Qwen3.5-2B-Instruct-Q4_K_M	~1.3 GB	Excellent daily assistant	Fast	Download
Phi-4-mini-instruct-Q4_K_M	~2.3 GB	Best for logic & coding	Medium	Download
Gemma-3-1B-It-Q4_K_M	~800 MB	Good general purpose	Fast	Download

Performance Tips:

Start with a tiny model like Qwen 3.5 0.8B or Gemma-3-1B to test compatibility.
Context size 512 or 1024 is usually sufficient for mobile scenarios.
Use 4 threads on most devices. Going above 4 threads might actually slow down generation.
Keep an eye on available RAM. The required RAM is roughly the file size of the GGUF model plus ~200MB overhead.

Version History

v1.3:
- Re-architected as a combined Jar + Native AAR setup to support multiple CPU architectures properly (arm64-v8a & armeabi-v7a).
- Added LowMemoryMode for devices with 2GB RAM or 32-bit systems.
- Added DebugMode property and _DebugLog event for performance tracking (tokens/sec).
v1.1: Added Chat templates, extended sampling parameters, GenerateRaw.
v1.0: Initial release.

Credits

Core engine: llama.cpp by Georgi Gerganov

Download

Filebin | oq0mgcpl4vl0328b

Convenient file sharing. Registration is not required. Large files are supported.

filebin.net

B4A Library LlamaEngine - Local LLMs

Filebin | oq0mgcpl4vl0328b

Attachments

Similar Threads