Local "chatGPT-like" ,llama.cpp, need help :)

magicmars

Member
Licensed User
I have succeeded in easily running locally a ChatGPT -like, based on llama.cpp. (https://github.com/ggerganov/llama.cpp)
after I have read this article :

The results and my impressions are very good : time responding on a PC with only 4gb, with 4/5 words per second.
Replies worth it !




If you want to test to run easily your own model language localy, on your PC. (without GPU, but CPU!)
you just need:

- 4 Gb RAM
- 4,2 Gb drive space.
- Some CPU

You can download the Windows version here::

Link : Download (rar, 3.5 G)

- just unrar, and run chat.exe

Options
are :

-i, --interactive run in interactive mode
--interactive-start run in interactive mode and poll user input at startup
-r PROMPT, --reverse-prompt PROMPT In interactive mode, poll user input upon seeing PROMPT
--color colorise output to distinguish prompt and user input from generations
-s SEED, --seed SEED RNG seed (default: -1)
-t N, --threads N number of threads to use during computation (default: 4)
-p PROMPT, --prompt PROMPT Prompt to start generation with (default: random)
-f FNAME, --file FNAME Prompt file to start generation.
-n N, --n_predict N number of tokens to predict (default: 128)
--top_k N top-k sampling (default: 40)
--top_p N top-p sampling (default: 0.9)
--repeat_last_n N last n tokens to consider for penalize (default: 64)
--repeat_penalty N penalize repeat sequence of tokens (default: 1.3)
-c N, --ctx_size N size of the prompt context (default: 2048)
--temp N temperature (default: 0.1)
-b N, --batch_size N batch size for prompt processing (default: 8)
-m FNAME, --model FNAME Model path (default: ggml-alpaca-7b-q4.bin)

  • temperature (optional): Controls the randomness of the generated text. Higher values produce more diverse results, while lower values produce more deterministic results.
  • top_p (optional): The cumulative probability threshold for token sampling. The model will only consider tokens whose cumulative probability is below this threshold.
  • top_k (optional): The number of top tokens to consider when sampling. The model will only consider the top_k highest-probability tokens.


Sources :
https://github.com/antimatter15/alpaca.cpp

I've seen someone try to port it on Ios :

Someone interrested in wrapping it for B4X ? :D

Android​

You can easily run llama.cpp on Android device with termux. First, obtain the Android NDK and then build with CMake:

$ mkdir build-android
$ cd build-android
$ export NDK=<your_ndk_directory>
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make

Install termux on your device and run termux-setup-storage to get access to your SD card. Finally, copy the llama binary and the model files to your device storage. Here is a demo of an interactive session running on Pixel 5 phone:
 
Last edited:

magicmars

Member
Licensed User
Sorry, I have to remove the download link :


Don't want problem 😅

MP me for more info.
 
Last edited:
Top