Week 1 of my AI Demo Series Β· cloudtoailearn.dev π GitHub repo: https://github.com/narenmak17/week01-simple-chat
Every AI tutorial starts with “call the OpenAI API”. That’s fine until you’re paying per token just to experiment with prompt design.
What if you could build a working chat app with zero API costs?
That’s exactly what Week 1 of my demo series tackles β and it didn’t go smoothly. Here’s the honest account.
What I Started With β And Why It Failed
I started with distilgpt2 β small, free, runs on CPU. Seemed like the obvious choice.
Every response was my question repeated back in a loop:
User: what is cloud computing
Assistant: what is cloud computing
Assistant: what is cloud computing...
This is not a bug. This is exactly what a base model does.
distilgpt2 was trained to predict the next token on internet text β it has no concept of answering a question. It just continues whatever pattern it sees. When it sees Assistant: followed by the user’s words, it keeps repeating that pattern.
The fix: switch to an instruction-tuned model.
What Actually Works β TinyLlama-1.1B-Chat
TinyLlama-1.1B-Chat is the same size class as distilgpt2 but fine-tuned on conversation data. It knows what <|assistant|> means β generate a helpful reply.
Same CPU requirement. Same free download. Completely different behaviour.
User: what is cloud computing
Assistant: Cloud computing is a technology that allows users to access
applications and data on the internet through a computer network called
the cloud. You don't need your own server or storage devices...
Key insight: Model size alone doesn’t determine quality. Fine-tuning purpose matters more. A 1.1B instruction-tuned model beats a 1.5B base model for chat every time.
The Architecture That Works
User β Gradio UI β Prompt Builder β TinyLlama (CPU) β Reply β UI
β
TinyLlama chat template
<|system|> + <|user|> + <|assistant|>
Figure 1: Full flow from user input to model response via Gradio and TinyLlama on CPU
The prompt builder uses TinyLlama’s specific chat template:
<|system|>
You are a helpful assistant.
<|user|>
What is cloud computing?
<|assistant|>
This structure is what tells the model to generate an answer rather than continue text. Without it, you get the loop problem above.
Other Things That Broke
Gradio 6.x Changed the Chat History Format
Older tutorials use (user, assistant) tuples. Gradio 6.x requires role/content dicts:
# broken in Gradio 6.x
history.append((user_message, reply))
# correct for Gradio 6.x
history.append({"role": "user", "content": user_message})
history.append({"role": "assistant", "content": reply})
This silently breaks the UI β the chat renders blank even though the model is responding correctly. Took longer to debug than it should have.
server_name="0.0.0.0" Doesn’t Work on Windows
The port was listening but the browser couldn’t connect. Fix:
# broken on Windows
demo.launch(server_name="0.0.0.0")
# correct on Windows
demo.launch(server_name="127.0.0.1")
The Model is Swappable β No Code Changes
The app reads MODEL_NAME and PROVIDER from environment variables:
# better local quality
set MODEL_NAME=microsoft/phi-2
# best local quality via Ollama
set PROVIDER=ollama
set MODEL_NAME=mistral
# free cloud API (no local GPU needed)
set PROVIDER=groq
set GROQ_API_KEY=your_key_here
Figure 2: Three provider modes β local HuggingFace, local Ollama, and cloud Groq API
This design means the same app code works across all three modes. You pick the provider based on your environment β laptop, server, or cloud.
What I Learned This Week
| Lesson | Detail |
|---|---|
| Base model β chat model | distilgpt2 repeats input β always use instruction-tuned for chat |
| Chat template is mandatory | Without <|system|>/<|user|>/<|assistant|> the model doesn’t know its role |
| Gradio 6.x breaking change | History format changed from tuples to role/content dicts |
| Windows networking quirk | 0.0.0.0 doesn’t bind correctly β use 127.0.0.1 |
| Environment variables > hardcoding | Makes model swapping trivial without touching app code |
Tech Stack
| Component | Tool |
|---|---|
| UI | Gradio 6.x |
| Model (default) | TinyLlama-1.1B-Chat (HuggingFace) |
| Inference | HuggingFace Transformers (CPU) |
| Alternative providers | Ollama (local), Groq API (cloud) |
| Hosting | Hugging Face Spaces (free, public) |
| CI/CD | GitHub Actions |
What’s Next β Week 2
Prompt guardrails and safety patterns β system prompts, refusal handling, and unit tests that prove the guardrails actually work.
Follow along at cloudtoailearn.dev or star the repo.
Have questions about the architecture or ran into similar issues? Connect with me on LinkedIn.
