Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 17 additions & 4 deletions README.ko.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,33 @@

---

## 3줄로 시작하기
## 빠른 시작

**Ollama 스타일 CLI (v0.12.0+):**
```bash
pip install quantcpp

quantcpp pull llama3.2:1b # HuggingFace에서 다운로드
quantcpp run llama3.2:1b # 대화형 채팅
quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버
quantcpp list # 캐시된 모델 목록
```

짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다.

**한 줄 질문:**
```bash
quantcpp run llama3.2:1b "중력이란 무엇인가요?"
```

**Python API (3줄):**
```python
from quantcpp import Model

m = Model.from_pretrained("Llama-3.2-1B") # 모델 자동 다운로드 (~750 MB)
m = Model.from_pretrained("Llama-3.2-1B")
print(m.ask("중력이란 무엇인가요?"))
```

API 키 없음. GPU 없음. 설정 없음. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)

---

Expand Down
26 changes: 15 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,27 +37,31 @@

## Quick Start

**Terminal (one command):**
**Ollama-style CLI (v0.12.0+):**
```bash
pip install quantcpp
quantcpp "What is gravity?"

quantcpp pull llama3.2:1b # download from HuggingFace
quantcpp run llama3.2:1b # interactive chat
quantcpp serve llama3.2:1b -p 8080 # OpenAI-compatible HTTP server
quantcpp list # show cached models
```

Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080.

**One-shot question:**
```bash
quantcpp run llama3.2:1b "What is gravity?"
```

**Python (3 lines):**
**Python API (3 lines):**
```python
from quantcpp import Model
m = Model.from_pretrained("Llama-3.2-1B")
print(m.ask("What is gravity?"))
```

**Interactive chat:**
```bash
quantcpp
# You: What is gravity?
# AI: Gravity is a fundamental force...
```

Downloads Llama-3.2-1B (~750 MB) on first use, cached locally. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**How it works — Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)

---

Expand Down
27 changes: 22 additions & 5 deletions site/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -727,12 +727,25 @@ <h2 class="reveal" data-i18n="glossary.title">Glossary</h2>
<section class="cta" style="background:var(--bg2)">
<div class="container reveal">
<h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
<p style="color:var(--text2);margin-bottom:2rem;max-width:500px;margin-left:auto;margin-right:auto" data-i18n="cta.desc">Three lines of Python. No GPU, no API key, no setup.</p>
<pre style="text-align:left;display:inline-block;margin-bottom:2rem"><code>pip install quantcpp
<p style="color:var(--text2);margin-bottom:2rem;max-width:560px;margin-left:auto;margin-right:auto" data-i18n="cta.desc">Ollama-style CLI. No GPU, no API key, no setup.</p>
<div style="display:flex;gap:1.5rem;flex-wrap:wrap;justify-content:center;margin-bottom:2rem;text-align:left">
<div>
<div style="font-size:.75rem;color:var(--text2);margin-bottom:.3rem;font-weight:600" data-i18n="cta.label.cli">CLI (v0.12.0+)</div>
<pre style="margin:0"><code>pip install quantcpp

quantcpp pull llama3.2:1b
quantcpp run llama3.2:1b
quantcpp serve llama3.2:1b -p 8080
quantcpp list</code></pre>
</div>
<div>
<div style="font-size:.75rem;color:var(--text2);margin-bottom:.3rem;font-weight:600" data-i18n="cta.label.python">Python API</div>
<pre style="margin:0"><code>from quantcpp import Model

from quantcpp import Model
m = Model.from_pretrained("Llama-3.2-1B")
print(m.ask("What is gravity?"))</code></pre>
</div>
</div>
<br>
<a href="https://github.com/quantumaikr/quant.cpp" class="cta-btn cta-primary">GitHub</a>
<a href="https://pypi.org/project/quantcpp/" class="cta-btn cta-secondary">PyPI</a>
Expand Down Expand Up @@ -896,7 +909,9 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
"glossary.gguf.term": "GGUF",
"glossary.gguf.def": "The standard file format for quantized LLM model weights, created by the llama.cpp project. quant.cpp loads GGUF models directly.",
"cta.title": "Try It Yourself",
"cta.desc": "Three lines of Python. No GPU, no API key, no setup.",
"cta.desc": "Ollama-style CLI. No GPU, no API key, no setup.",
"cta.label.cli": "CLI (v0.12.0+)",
"cta.label.python": "Python API",
"rag.label": "Movement",
"rag.title": "Beyond RAG",
"rag.intro": "Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. <strong>Now they have 128K. The compromise should have started disappearing.</strong>",
Expand Down Expand Up @@ -1083,7 +1098,9 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
"glossary.gguf.term": "GGUF",
"glossary.gguf.def": "\uC591\uC790\uD654\uB41C LLM \uBAA8\uB378 \uAC00\uC911\uCE58\uC758 \uD45C\uC900 \uD30C\uC77C \uD615\uC2DD. llama.cpp \uD504\uB85C\uC81D\uD2B8\uC5D0\uC11C \uB9CC\uB4E4\uC5C8\uC2B5\uB2C8\uB2E4. quant.cpp\uB294 GGUF \uBAA8\uB378\uC744 \uC9C1\uC811 \uB85C\uB4DC\uD569\uB2C8\uB2E4.",
"cta.title": "\uC9C1\uC811 \uD574\uBCF4\uAE30",
"cta.desc": "Python 3\uC904. GPU\uB3C4, API \uD0A4\uB3C4, \uC124\uCE58\uB3C4 \uD544\uC694 \uC5C6\uC2B5\uB2C8\uB2E4.",
"cta.desc": "Ollama \uC2A4\uD0C0\uC77C CLI. GPU\uB3C4, API \uD0A4\uB3C4, \uC124\uCE58\uB3C4 \uD544\uC694 \uC5C6\uC2B5\uB2C8\uB2E4.",
"cta.label.cli": "CLI (v0.12.0+)",
"cta.label.python": "Python API",
"rag.label": "운동",
"rag.title": "Beyond RAG",
"rag.intro": "전통적인 RAG는 문서를 512토큰 청크로 나누고, 벡터 DB에 임베딩하고, 조각을 검색합니다. 이것은 LLM이 2K 컨텍스트만 가졌을 때 합리적인 엔지니어링 타협이었습니다. <strong>지금은 128K입니다. 그 타협은 사라지기 시작했어야 합니다.</strong>",
Expand Down
Loading