B4A Library LlamaEngine - Local LLMs

LlamaEngine - Run LLM Models Locally on Android

Run large language models (LLMs) directly on your Android device - no internet, no API keys, no cloud services required!

LlamaEngine wraps llama.cpp into a native B4A library, allowing you to load and run GGUF model files with streaming token generation on-device.

If you like my work, please consider supporting the me.

Features
  • GGUF Support: Load any compatible GGUF model (Qwen, Llama, Gemma, Phi, Mistral, etc.)
  • Streaming Generation: Receive events as each token is generated in real-time
  • Chat Templates: Built-in support for ChatML, Llama 3, Gemma, Alpaca, and Raw modes
  • Multi-turn Chat: History management (System, User, Assistant messages)
  • Configurable Sampling: Fine-tune Temperature, Top-P, Top-K, Min-P, and Repeat Penalty
  • Low Memory Mode: Dedicated toggle (LowMemoryMode) for devices with limited RAM (e.g., 2GB or 32-bit devices)
  • Debugging: Enable DebugMode for detailed performance metrics and logging via _DebugLog event
  • Completely Offline: Runs entirely on the device CPU
  • Architecture Support: Supports both modern 64-bit (arm64-v8a) and older 32-bit (armeabi-v7a) Android devices

Requirements
  • Android 7.0+ (API 24)
  • Supported Architectures: arm64-v8a, armeabi-v7a
  • A GGUF model file downloaded to your device storage

Installation
The library is split into a Java wrapper and an AAR containing the compiled C++ binaries.
  1. Download the release archive containing:
    • LlamaEngine.jar
    • LlamaEngine.xml
    • LlamaEngine_Native.aar
  2. Copy all three files to your B4A Additional Libraries folder.
  3. In B4A IDE: Right-click the Libraries Manager -> Refresh.
  4. Check "LlamaEngine".

Basic Example
B4X:
Sub Globals
    Dim llm As LlamaEngine
End Sub

Sub Activity_Create(FirstTime As Boolean)
    llm.Initialize(Me, "LLM")

    ' Optional: Enable debug logs and low memory optimizations
    llm.DebugMode = True
    ' llm.LowMemoryMode = True ' Uncomment for low RAM devices

    ' Load a GGUF model (adjust path to your model file)
    ' Params: Path, Threads, ContextSize
    llm.LoadModel("/sdcard/Download/Qwen3-0.6B-Q4_K_M.gguf", 4, 512)
End Sub

Sub LLM_LoadProgress(Progress As Float)
    Log("Loading: " & Round(Progress * 100) & "%")
End Sub

Sub LLM_ModelLoaded(Success As Boolean)
    Log("Model loaded: " & Success)
    If Success Then
        llm.AddSystemMessage("You are a helpful assistant.")
        llm.AddUserMessage("What is the capital of France?")
        llm.GenerateChat(256, 0.7) ' Generate max 256 tokens with 0.7 temperature
    End If
End Sub

Sub LLM_TokenGenerated(Token As String)
    ' Each token is streamed as it's generated
    Log(Token)
End Sub

Sub LLM_Complete(Response As String)
    Log("Full response: " & Response)
End Sub

Sub LLM_Error(Message As String)
    Log("Error: " & Message)
End Sub

Sub LLM_DebugLog(Message As String)
    ' Triggered if llm.DebugMode = True
    Log("[LLM DEBUG] " & Message)
End Sub

Chat Templates
Different models use different prompt formats. Set the template before generating:
B4X:
' For Qwen, Yi, Mistral-Instruct (default)
llm.ChatTemplate = "chatml"

' For Llama 3 / 3.1
llm.ChatTemplate = "llama3"

' For Gemma / Gemma 2
llm.ChatTemplate = "gemma"

' For Alpaca-style fine-tuned models
llm.ChatTemplate = "alpaca"

' No template - you build the prompt yourself
llm.ChatTemplate = "raw"

Sampling Parameters
Fine-tune the text generation:
B4X:
llm.TopP = 0.9            ' Nucleus sampling (0.0-1.0)
llm.TopK = 40             ' Top-K sampling (0 = disabled)
llm.RepeatPenalty = 1.1   ' Penalize repetitions (1.0 = off)
llm.MinP = 0.05           ' Min-P filtering (0.0-1.0)

Multi-Turn Conversation
B4X:
' Build a conversation
llm.ClearHistory
llm.AddSystemMessage("You are a cooking assistant.")
llm.AddUserMessage("How do I make pasta?")
llm.GenerateChat(512, 0.7)

' After getting the response, continue the conversation:
Sub LLM_Complete(Response As String)
    llm.AddAssistantMessage(Response)  ' Save the AI's response
    llm.AddUserMessage("What sauce goes best with it?")
    llm.GenerateChat(512, 0.7)         ' Generate next response
End Sub

Raw Prompt Mode
For full control over the prompt format without modifying chat history:
B4X:
llm.GenerateRaw("Once upon a time,", 256, 0.8)

API Reference

Methods:
MethodDescription
Initialize(CallBack, EventName)Initialize the engine
LoadModel(Path, Threads, ContextSize)Load a GGUF model file asynchronously
AddSystemMessage(Text)Add system instruction to chat history
AddUserMessage(Text)Add user message to chat history
AddAssistantMessage(Text)Add AI response to chat history
ClearHistoryClear all chat messages
GenerateChat(MaxTokens, Temperature)Generate from chat history
Generate(SystemPrompt, UserPrompt, MaxTokens, Temp)Single-turn generation without history
GenerateRaw(Prompt, MaxTokens, Temperature)Generate from raw unformatted prompt
StopGenerationStop current generation safely
UnloadFree model and RAM memory

Properties:
PropertyTypeDefaultDescription
IsLoadedBoolean-True if a model is currently loaded
IsGeneratingBoolean-True if text generation is running
DebugModeBooleanFalseEnables detailed logging and _DebugLog events
LowMemoryModeBooleanFalseReduces compute buffer allocation for low RAM devices
ChatTemplateString"chatml"Prompt template format
TopPFloat0.9Nucleus sampling
TopKInt40Top-K sampling
RepeatPenaltyFloat1.1Repetition penalty
MinPFloat0.05Min-P filtering
HistorySizeInt-Number of chat messages currently in history

Events:
EventDescription
_ModelLoaded(Success As Boolean)Model loading finished
_LoadProgress(Progress As Float)Loading progress (0.0-1.0)
_TokenGenerated(Token As String)Emitted for each token as it's generated
_Complete(Response As String)The fully combined response string
_Error(Message As String)An error occurred
_DebugLog(Message As String)Performance and debug info (requires DebugMode = True)

Recommended Models
You can download .gguf files directly from HuggingFace. Models must be stored on your device storage (e.g. /sdcard/Download/).

ModelSizeQualitySpeedLink
Qwen3.5-0.8B-Instruct-Q4_K_M~600 MBGreat for simple tasksVery FastDownload
Qwen3.5-2B-Instruct-Q4_K_M~1.3 GBExcellent daily assistantFastDownload
Phi-4-mini-instruct-Q4_K_M~2.3 GBBest for logic & codingMediumDownload
Gemma-3-1B-It-Q4_K_M~800 MBGood general purposeFastDownload

Performance Tips:
  • Start with a tiny model like Qwen 3.5 0.8B or Gemma-3-1B to test compatibility.
  • Context size 512 or 1024 is usually sufficient for mobile scenarios.
  • Use 4 threads on most devices. Going above 4 threads might actually slow down generation.
  • Keep an eye on available RAM. The required RAM is roughly the file size of the GGUF model plus ~200MB overhead.

Version History
  • v1.3:
    • Re-architected as a combined Jar + Native AAR setup to support multiple CPU architectures properly (arm64-v8a & armeabi-v7a).
    • Added LowMemoryMode for devices with 2GB RAM or 32-bit systems.
    • Added DebugMode property and _DebugLog event for performance tracking (tokens/sec).
  • v1.1: Added Chat templates, extended sampling parameters, GenerateRaw.
  • v1.0: Initial release.

Credits

Download
 

Attachments

  • LlamaExample.zip
    13.8 KB · Views: 16
Last edited:
Top