LASTBOX
offline survival assistant in a pelican case.
A Raspberry Pi 5, a LoRa 868 MHz radio HAT, a camera and a fine-tuned Gemma 4 E2B, trained with Unsloth on a single GB10 and served by llama.cpp on an ARM CPU. Answers survival questions, identifies plants and wounds from its lens, relays terse messages over a Meshtastic mesh — no internet, no cloud, no phone-home. When the network is the first thing to fail, you still have the box.
WHAT IT IS
LastBox is what you reach for when the grid stops working. The hardware is deliberately boring — a Raspberry Pi 5 with 8 GB of RAM in a sealed Pelican case, a 12 V LiFePO₄ battery, a camera module, a LoRa radio HAT — but the software does three things that are normally cloud features:
Survival Q&A
Touch or type a question. A fine-tuned Gemma 4 E2B replies in 1–2 sentences with a numbered procedure if appropriate. Hard byte caps so every reply fits.
FIRST AID · BUSHCRAFT · NAVIGATION · POWER · HAZARDSOptical triage
Aim the camera at a plant, a wound or a piece of gear. The same model answers via a 940 MB SigLIP vision encoder. Defaults to conservative replies — "unknown plant, do not eat" beats a wrong identification.
SIGLIP MMPROJ · CC-LICENSED EVAL IMAGESMesh radio relay
Reply payloads are hard-capped at 150 bytes UTF-8 so they fit in a single LoRa packet at legal duty cycle. The box becomes a thinking router in a Meshtastic mesh of handhelds.
868 MHz · MESHTASTIC USB OR SX1262 SPIARCHITECTURE
┌──────────────── Touchscreen / Web UI / LoRa packet in ────────────────┐ │ │ │ webapp/server.py ──or── demo.py orchestrator │ │ │ │ │ │ ▼ HTTP /v1/chat/completions ▼ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ llama-server (Docker on RPi 5) │ │ │ │ ghcr.io/ggml-org/llama.cpp:server │ │ │ │ ──────────────────────────────── │ │ │ │ -m lastbox-gemma4-e2b-q4_k_m.gguf (3.4 GB) │ │ │ │ --mmproj mmproj-F16.gguf (940 MB) │ │ │ │ --threads 4 --ctx 2048 --parallel 1 │ │ │ │ port 11436 → 8080 │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ text + optional <tool_call> blocks │ │ Tool dispatcher (in demo.py): │ │ - search_knowledge → local SQLite / dict of survival manuals │ │ - capture_image → RPi camera + multimodal model │ │ - analyze_signal → LoRa HAT RSSI / SNR stats │ │ - send_lora_message → Meshtastic firmware via serial │ │ - get_system_status → psutil + /sys/class/thermal │ │ - listen_lora → channel scan w/ pattern filter │ │ - update_memory → atomic toml write │ │ │ │ Tool result → injected as next user turn → final answer │ └──────────────────────────────────────────────────────────────────────┘
TRAINING
One GB10 (Grace Blackwell, DGX Spark, 121 GB unified RAM, aarch64 + CUDA 13).
Unsloth's FastModel on Gemma 4 E2B-it, LoRA r=8 α=8, bf16,
cosine schedule. Dataset generated by Kimi K2.5 as teacher
via OpenRouter — 30 inline survival seeds × 5 categories × 8 variants,
JSONL-strict, byte-cap validated, deduped.
The numbers
| DATA GEN | $1.10 · ~30 min · 1151 raw → 1148 kept (99.7%) |
| SFT v2 (3 epochs) | 43 min on GB10 · train loss 0.08 |
| SFT v3 no-think (shipped) | 27 min on GB10 · 2 epochs · template fix kills CoT preamble |
| GGUF EXPORT | ~35 s · bf16 → Q4_K_M |
| DEPLOY | 5-7 min rsync over Tailscale to lastbox |
| END-TO-END CLEAN RUN | ~1.5 h data → deploy → eval |
Loss curve (SFT v2)
step train eval 5 3.37 10 2.29 15 1.38 20 1.09 30 0.77 50 0.26 2.62 100 0.08 2.64 150 0.07 2.65 195 0.08 2.64Eval plateau ≈ 2.64 reflects the 114-dialog held-out set; what counts is agent-level eval below.
BENCHMARKS — V2 vs V3 ON THE DEPLOYED BOX
Two trained checkpoints, both benchmarked live against the RPi 5 over Tailscale. v3 shipped because the qualitative gap was decisive even when headline numbers looked similar.
| METRIC | v2 (thinking) | v3 (no-think) — shipped |
|---|---|---|
| Response-quality (25 samples) | 0.518 | 0.506 |
| Of completed dialogs | 13/14 (93%) | 13/14 (93%) |
| format_ok hybrid | 0.52 | 0.52 |
| byte_compliance (≤150 / ≤200 B) | 0.48 | 0.48 |
| persona_ok (no preambles) | 0.56 | 0.52 |
| Median first-token (warm) | ~1.5 s | ~0.7 s |
| Smoke-test response | "Thinking Process: 1. Analyze… 2. Determine…" (516 B) | "1. Apply direct, firm pressure to the wound with a clean cloth." (63 B) |
| Smoke-test end-to-end | 19 s | 4.7 s |
| Sustained generation | 6.4–7 tok/s | 6.4–7 tok/s |
The difference between v2 and v3 is the difference between a thinking-out-loud model with a tighter style and a survival agent that just gives you the answer. v3 is shipped.
Try it now — chat with v6 (Hugging Face Space)
Live demo on Hugging Face Spaces · ZeroGPU backend · three modes: LoRa Radio (≤150 B), Free Chat, RAG Chat with citations. Cold start ~30 s, warm responses 3–10 s.
Post-deadline v6 (SFT warmup) — tool emission solved
After GRPO v4+v5 plateaued at 0% tool emission, we executed roadmap #1 (SFT warmup on tool-only pairs) and discovered the eval itself had a prompt-mismatch bug. Two changes, one breakthrough:
- Filtered train_v2 to 1034
[user, assistant_tool_call]pairs. - 12-minute Unsloth SFT (r=8, α=8, lr=2e-4, 1 epoch, 65 steps). Loss 0.018.
- Fixed the eval to use the full training-time system prompt
(with tool definitions JSON + format hint) — not the shorter
SYSTEM_PROMPT_ENalone. Eval had been suppressing emission by omitting the tool defs block.
| Metric | v3 SFT | v4 GRPO | v5 GRPO | v6 (stream eval) | v6 final |
|---|---|---|---|---|---|
| tool_emission_rate | ~0% | 0% | 0% | 48% | 72% |
| tool_accuracy | 0% | 0% | 0% | 44% | 64% |
| arg_validity | 4% | 4% | 4% | 36% | 56% |
| agentic_score | 0.016 | 0.016 | 0.016 | 0.408 | 0.608 |
| byte_compliance | 0.48 | 0.52 | 0.52 | 0.52 | 1.000 |
| format_ok | 0.52 | 0.52 | 0.52 | 0.52 | 1.000 |
| persona_ok | 0.52 | 0.52 | 0.52 | 0.52 | 1.000 |
| response_quality | 0.506 | 0.520 | 0.520 | 0.520 | 1.000 |
| completed / 25 | 14 | 13 | 13 | 13 | 25 |
38× jump in agentic_score (0.016 → 0.608) from one 12-minute SFT pass + two eval-methodology fixes. The 0.52 ceiling on byte_compliance / format_ok / persona_ok turned out to be 13/25 completion rate from streaming-SSE disconnects — never a quality issue. Switching the eval to non-streaming POST + 2-retry on disconnect moved completion to 25/25 and pulled every flag to 1.000.
Post-deadline GRPO experiment (v4 + v5)
After the hackathon submission we ran two GRPO iterations to lift tool_emission from the v3 baseline. The headline result: RAG is live on the box and answers cite source IDs; GRPO with KL=0.04 cannot move tool_emission from ~0% in 200 steps even with an active −0.5 penalty for skipping the tool when expected.
| Metric | v3 SFT | v4 GRPO (reward v1) | v5 GRPO (reward v2) |
|---|---|---|---|
| tool_emission_rate | ~0% | 0% | 0% |
| byte_compliance | 0.48 | 0.52 | 0.52 |
| format_ok | 0.52 | 0.52 | 0.52 |
| response_quality | 0.506 | 0.520 | 0.520 |
| median first-token | 9 184 ms | 1 735 ms | 8 460 ms |
| completed / 25 | 14 | 13 | 13 |
The lift in byte_compliance + the v4 latency win prove GRPO can move smaller behaviours under this beta. The tool_emission plateau is about policy-shift size, not reward density: the KL term blocks the large move from p(tool)≈0 to p(tool)≈1. Two paths actually fix this, neither is "another GRPO iteration":
- SFT warmup on tool-only pairs — filter
train_v2.jsonlto the 993 prompts whose first assistant turn is a<tool_call>, do a quick 50–200-step SFT to set the prior, then GRPO refines without needing to break the KL ceiling. - GBNF-constrained decoding in
llama.cpp— force the first tokens into<tool_call>{…}</tool_call>form, gated by a "should I tool?" classifier in the orchestrator.
Sample real interactions
you (touchscreen)> How many watts solar panel for my RPi 5?
lastbox> Min 15 W panel, recommended 20 W with an MPPT regulator for
24/7 operation in cloudy conditions.
(first token 668 ms)
you (lora)> hypothermia signs?
lastbox> Shivering, slurred speech, confusion. Warm gradually, dry
clothes, sweet warm drink. Severe: handle gently, evacuate.
(138 B UTF-8 — under LoRa 150 B cap)
you (touchscreen)> Pi 5 throws undervoltage warning even with the
official 27W PSU. What to check?
lastbox> Check USB-C cable AWG ≤20, no powered USB peripherals draining
current, dmesg | grep undervolt for kernel-side power throttles.
(165 B — under 200 B touchscreen cap)
THE BOX // LIVE UI
The same Pip-Boy aesthetic as this page is the lastbox's actual interface. Two columns: live MJPEG stream from the Raspberry Pi camera on the left, operator-chatter radio chat on the right. Snap a frame, ask Gemma about what it sees; type a query "from a remote handheld", get a reply that fits in a 150-byte LoRa packet — both go into the same chat log so the timeline is coherent.
Served by webapp/server.py — stdlib Python only, zero pip
deps. Lives on the SD card, not the NVMe. Reachable on
http://lastbox.local:8080/ from any device on the same LAN.
[node-remote → LASTBOX-A] stop bleeding arm fast
[LASTBOX-A → mesh] Apply direct, firm pressure to the wound immediately.
1. Use a clean cloth or bandage.
2. Keep pressure constant.
[OPTICAL → LASTBOX-A] what do you see?
[LASTBOX-A · vision] The image shows a plain, light-colored, flat surface with a subtle shadow across it. There are no visible plants, wounds, or immediate hazards present.
>_
ROADMAP — WIRED, NOT VAPOR
A handful of features were intentionally left as the next iteration. Each one is wired in the codebase and gated on a clear external signal — a plugged-in device, a freed GPU hour, a register-level fix. They are not vapor, they are switches.
Voice in, voice out
The orchestrator separates intent capture from intent dispatch, so a mic path on the front is one endpoint:
arecord 16 kHz mono 5 s → whisper.cpp tiny.en (~75 MB) → POST /radio-query → Gemma 4 E2B reply → piper | espeak-ng → speaker
Blocker: the ReSpeaker 2-Mic HAT we have ships with a TLV320AIC3104 codec instead of the silkscreened WM8960; the standard overlay fails with -121, the fallback overlay loads but leaves the ADC muted. Two known fixes (custom overlay or an i2cset register sequence) — both short, neither shippable inside the deadline window.
Mesh radio — real packets
demo.py already calls meshtastic --port for
send_lora_message and listen_lora; the Pip-Boy
"RADIO" UI calls those same code paths. /mesh-status
reports the live hardware truth, so the UI degrades to local inference
under the real 150-byte cap.
The moment a working LoRa device shows up on /dev/ttyUSB0
or the SX1262 SPI pins go active, the relay path lights up without a
code change.
Tool-call training (GRPO)
The SFT model rarely emits <tool_call> blocks
(~0% in eval), so the orchestrator currently keyword-routes between
tools. The clean fix is a GRPO pass with
r = +1 if expected_tool_called else 0 against the same
dialog set — ~1 h additional GB10 time.
RAG over offline survival manuals
v1 (Polish) shipped a working RAG pipeline — nomic-embed-text
(~180 MB) + libzim ZIM dumps + top-K passage
injection. v2 ships without it on purpose so the baseline numbers
measure what the fine-tune itself knows. The next iteration brings
it back behind a ?rag=true flag.
user → embed (nomic, ~80 ms)
→ sqlite-vss ANN top-K
→ inject passages
→ llama-server (existing path)
→ answer + cited passage IDs
Corpus on the SD card (~2.5 GB total, fits today's budget): US Army
FM 21-76 (public domain), WikiMed
ZIM dump, our own train_v2.jsonl, and a trimmed
Wikipedia survival/first-aid subset. Win is citations —
every answer carries a "FM 21-76, Ch. 4, p. 87" tag so the operator
knows where the advice came from.
Image-paired fine-tune
Today the vision branch is untrained — Gemma's pretraining handles "what is in this picture?" perfectly, but the rowan-vs-yew toxicity distinction (load-bearing for a survival assistant) needs an image-paired SFT. ~500 CC-licensed plant photos × hybrid-format labels would close the gap.
Access-point mode
Today the lastbox joins an existing WiFi network and is reachable on
http://lastbox.local:8080/. With hostapd +
dnsmasq, the same box becomes the network — connect from any
phone to SSID lastbox and the same UI is there.
Out of scope for v1.
NVMe power-saving fix
The on-device NVMe crashed ~2 h before the deadline (classic RPi 5
PCIe power-saving fault — CSTS=0xffffffff). The webapp
was rebuilt stdlib-only and deployed to the SD card so the demo
wouldn't blink. v2 boots with
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
pcie_port_pm=off.