LlamaEngine - Run LLM Models Locally on Android
Run large language models (LLMs) directly on your Android device - no internet, no API keys, no cloud services required!
LlamaEngine wraps llama.cpp into a native B4A library, allowing you to load and run GGUF model files with streaming token generation on-device.
If you like my work, please consider supporting the me.
Features
Requirements
Installation
The library is split into a Java wrapper and an AAR containing the compiled C++ binaries.
Basic Example
Chat Templates
Different models use different prompt formats. Set the template before generating:
Sampling Parameters
Fine-tune the text generation:
Multi-Turn Conversation
Raw Prompt Mode
For full control over the prompt format without modifying chat history:
API Reference
Methods:
Properties:
Events:
Recommended Models
You can download .gguf files directly from HuggingFace. Models must be stored on your device storage (e.g. /sdcard/Download/).
Performance Tips:
Version History
Credits
Download
filebin.net
Run large language models (LLMs) directly on your Android device - no internet, no API keys, no cloud services required!
LlamaEngine wraps llama.cpp into a native B4A library, allowing you to load and run GGUF model files with streaming token generation on-device.
If you like my work, please consider supporting the me.
Features
- GGUF Support: Load any compatible GGUF model (Qwen, Llama, Gemma, Phi, Mistral, etc.)
- Streaming Generation: Receive events as each token is generated in real-time
- Chat Templates: Built-in support for ChatML, Llama 3, Gemma, Alpaca, and Raw modes
- Multi-turn Chat: History management (System, User, Assistant messages)
- Configurable Sampling: Fine-tune Temperature, Top-P, Top-K, Min-P, and Repeat Penalty
- Low Memory Mode: Dedicated toggle (LowMemoryMode) for devices with limited RAM (e.g., 2GB or 32-bit devices)
- Debugging: Enable DebugMode for detailed performance metrics and logging via _DebugLog event
- Completely Offline: Runs entirely on the device CPU
- Architecture Support: Supports both modern 64-bit (arm64-v8a) and older 32-bit (armeabi-v7a) Android devices
Requirements
- Android 7.0+ (API 24)
- Supported Architectures: arm64-v8a, armeabi-v7a
- A GGUF model file downloaded to your device storage
Installation
The library is split into a Java wrapper and an AAR containing the compiled C++ binaries.
- Download the release archive containing:
- LlamaEngine.jar
- LlamaEngine.xml
- LlamaEngine_Native.aar
- Copy all three files to your B4A Additional Libraries folder.
- In B4A IDE: Right-click the Libraries Manager -> Refresh.
- Check "LlamaEngine".
Basic Example
B4X:
Sub Globals
Dim llm As LlamaEngine
End Sub
Sub Activity_Create(FirstTime As Boolean)
llm.Initialize(Me, "LLM")
' Optional: Enable debug logs and low memory optimizations
llm.DebugMode = True
' llm.LowMemoryMode = True ' Uncomment for low RAM devices
' Load a GGUF model (adjust path to your model file)
' Params: Path, Threads, ContextSize
llm.LoadModel("/sdcard/Download/Qwen3-0.6B-Q4_K_M.gguf", 4, 512)
End Sub
Sub LLM_LoadProgress(Progress As Float)
Log("Loading: " & Round(Progress * 100) & "%")
End Sub
Sub LLM_ModelLoaded(Success As Boolean)
Log("Model loaded: " & Success)
If Success Then
llm.AddSystemMessage("You are a helpful assistant.")
llm.AddUserMessage("What is the capital of France?")
llm.GenerateChat(256, 0.7) ' Generate max 256 tokens with 0.7 temperature
End If
End Sub
Sub LLM_TokenGenerated(Token As String)
' Each token is streamed as it's generated
Log(Token)
End Sub
Sub LLM_Complete(Response As String)
Log("Full response: " & Response)
End Sub
Sub LLM_Error(Message As String)
Log("Error: " & Message)
End Sub
Sub LLM_DebugLog(Message As String)
' Triggered if llm.DebugMode = True
Log("[LLM DEBUG] " & Message)
End Sub
Chat Templates
Different models use different prompt formats. Set the template before generating:
B4X:
' For Qwen, Yi, Mistral-Instruct (default)
llm.ChatTemplate = "chatml"
' For Llama 3 / 3.1
llm.ChatTemplate = "llama3"
' For Gemma / Gemma 2
llm.ChatTemplate = "gemma"
' For Alpaca-style fine-tuned models
llm.ChatTemplate = "alpaca"
' No template - you build the prompt yourself
llm.ChatTemplate = "raw"
Sampling Parameters
Fine-tune the text generation:
B4X:
llm.TopP = 0.9 ' Nucleus sampling (0.0-1.0)
llm.TopK = 40 ' Top-K sampling (0 = disabled)
llm.RepeatPenalty = 1.1 ' Penalize repetitions (1.0 = off)
llm.MinP = 0.05 ' Min-P filtering (0.0-1.0)
Multi-Turn Conversation
B4X:
' Build a conversation
llm.ClearHistory
llm.AddSystemMessage("You are a cooking assistant.")
llm.AddUserMessage("How do I make pasta?")
llm.GenerateChat(512, 0.7)
' After getting the response, continue the conversation:
Sub LLM_Complete(Response As String)
llm.AddAssistantMessage(Response) ' Save the AI's response
llm.AddUserMessage("What sauce goes best with it?")
llm.GenerateChat(512, 0.7) ' Generate next response
End Sub
Raw Prompt Mode
For full control over the prompt format without modifying chat history:
B4X:
llm.GenerateRaw("Once upon a time,", 256, 0.8)
API Reference
Methods:
| Method | Description |
|---|---|
| Initialize(CallBack, EventName) | Initialize the engine |
| LoadModel(Path, Threads, ContextSize) | Load a GGUF model file asynchronously |
| AddSystemMessage(Text) | Add system instruction to chat history |
| AddUserMessage(Text) | Add user message to chat history |
| AddAssistantMessage(Text) | Add AI response to chat history |
| ClearHistory | Clear all chat messages |
| GenerateChat(MaxTokens, Temperature) | Generate from chat history |
| Generate(SystemPrompt, UserPrompt, MaxTokens, Temp) | Single-turn generation without history |
| GenerateRaw(Prompt, MaxTokens, Temperature) | Generate from raw unformatted prompt |
| StopGeneration | Stop current generation safely |
| Unload | Free model and RAM memory |
Properties:
| Property | Type | Default | Description |
|---|---|---|---|
| IsLoaded | Boolean | - | True if a model is currently loaded |
| IsGenerating | Boolean | - | True if text generation is running |
| DebugMode | Boolean | False | Enables detailed logging and _DebugLog events |
| LowMemoryMode | Boolean | False | Reduces compute buffer allocation for low RAM devices |
| ChatTemplate | String | "chatml" | Prompt template format |
| TopP | Float | 0.9 | Nucleus sampling |
| TopK | Int | 40 | Top-K sampling |
| RepeatPenalty | Float | 1.1 | Repetition penalty |
| MinP | Float | 0.05 | Min-P filtering |
| HistorySize | Int | - | Number of chat messages currently in history |
Events:
| Event | Description |
|---|---|
| _ModelLoaded(Success As Boolean) | Model loading finished |
| _LoadProgress(Progress As Float) | Loading progress (0.0-1.0) |
| _TokenGenerated(Token As String) | Emitted for each token as it's generated |
| _Complete(Response As String) | The fully combined response string |
| _Error(Message As String) | An error occurred |
| _DebugLog(Message As String) | Performance and debug info (requires DebugMode = True) |
Recommended Models
You can download .gguf files directly from HuggingFace. Models must be stored on your device storage (e.g. /sdcard/Download/).
| Model | Size | Quality | Speed | Link |
|---|---|---|---|---|
| Qwen3.5-0.8B-Instruct-Q4_K_M | ~600 MB | Great for simple tasks | Very Fast | Download |
| Qwen3.5-2B-Instruct-Q4_K_M | ~1.3 GB | Excellent daily assistant | Fast | Download |
| Phi-4-mini-instruct-Q4_K_M | ~2.3 GB | Best for logic & coding | Medium | Download |
| Gemma-3-1B-It-Q4_K_M | ~800 MB | Good general purpose | Fast | Download |
Performance Tips:
- Start with a tiny model like Qwen 3.5 0.8B or Gemma-3-1B to test compatibility.
- Context size 512 or 1024 is usually sufficient for mobile scenarios.
- Use 4 threads on most devices. Going above 4 threads might actually slow down generation.
- Keep an eye on available RAM. The required RAM is roughly the file size of the GGUF model plus ~200MB overhead.
Version History
- v1.3:
- Re-architected as a combined Jar + Native AAR setup to support multiple CPU architectures properly (arm64-v8a & armeabi-v7a).
- Added LowMemoryMode for devices with 2GB RAM or 32-bit systems.
- Added DebugMode property and _DebugLog event for performance tracking (tokens/sec).
- v1.1: Added Chat templates, extended sampling parameters, GenerateRaw.
- v1.0: Initial release.
Credits
- Core engine: llama.cpp by Georgi Gerganov
Download
Filebin | oq0mgcpl4vl0328b
Convenient file sharing. Registration is not required. Large files are supported.
Attachments
Last edited: