Not to be confused with Grok (xAI’s chatbot, different company, different product). This page is Groq, the LPU inference provider.
One of the fastest LLM providers on the market in 2026. Custom silicon called the Language Processing Unit (LPU) is optimized for low-latency model serving, and Groq’s API exposes supported open models through an OpenAI-compatible developer surface.
System Verdict
Pick Groq if your workload is latency-sensitive. Real-time voice agents, streaming chat interfaces, interactive AI applications all feel qualitatively different at 500+ tokens/second. You notice the speed the first time you try it.
Skip Groq if you need frontier proprietary models. Groq serves supported open and open-compatible model routes. For the newest closed frontier ChatGPT, Claude, or Gemini models, go to the source provider.
The 2026 context: Open-weight flagships have closed the gap on many tasks, but quality still varies by job. Groq’s edge is not “best model”; it is fast serving, simple API migration, and lower-latency economics for the open models it supports.
Key Facts
| Free tier | 30 requests/min, 6,000 tokens/min, 14,400 requests/day |
| Developer tier | 10x free rate limits, 25 percent discount on tokens |
| Llama 4 Scout 17B | $0.11 input / $0.34 output per M tokens (594 TPS) |
| Llama 3.3 70B Versatile | $0.59 input / $0.79 output per M tokens (394 TPS) |
| Llama 3.1 8B Instant | $0.05 input / $0.08 output per M tokens (840 TPS) |
| Qwen3 32B | $0.29 input / $0.59 output per M tokens (662 TPS) |
| GPT OSS 20B | $0.075 input / $0.30 output per M tokens (1,000 TPS) |
| GPT OSS 120B | $0.15 input / $0.60 output per M tokens (500 TPS) |
| Speed | Up to 1,000 tokens/second on GPT OSS 20B; 394 to 840 TPS on Llama-family models |
| Hardware | Custom LPU (Language Processing Unit) silicon |
| Batch API | 50 percent discount for non-real-time workloads (24h to 7d windows) |
| Prompt caching | 50 percent off cached input tokens, no extra caching fee |
When to pick Groq
- Real-time voice applications. Users feel sub-200ms response times. Groq’s streaming LLM inference makes this achievable with open-weight models.
- Streaming chat interfaces. Token streaming that displays in real time. On Groq, the full response often lands before the user finishes reading the first line.
- Production apps scaling open-weight. Low per-token pricing plus low latency can create strong unit economics for Llama, Qwen, Whisper, DeepSeek, and compatible open-model deployments.
- Agent loops with tight latency budgets. Multi-step agent workflows where each LLM call must return fast to meet overall SLA.
When to pick something else
- Frontier proprietary quality: Go direct to OpenAI, Anthropic, or Google.
- Max model variety: Fal.ai (600+ models) or Fireworks AI (400+ models) for broader catalog.
- Long-context workflows: Groq supports long context on supported models but caps below frontier API offerings.
- Consumer chat UI: Groq is API-first. Use Ollama + a chat UI or ChatGPT for consumer workflows.
Pricing
Pricing is per-token and predictable.
| Model | Input $/M tokens | Output $/M tokens | Speed (TPS) |
|---|---|---|---|
| Llama 3.1 8B Instant | $0.05 | $0.08 | 840 |
| GPT OSS 20B | $0.075 | $0.30 | 1,000 |
| Llama 4 Scout 17B | $0.11 | $0.34 | 594 |
| GPT OSS 120B | $0.15 | $0.60 | 500 |
| Qwen3 32B | $0.29 | $0.59 | 662 |
| Llama 3.3 70B Versatile | $0.59 | $0.79 | 394 |
Rate tiers: Free (30 req/min, 14,400/day). Developer (10x free + 25 percent off). Enterprise (custom). Batch API: 50 percent off for 24-hour to 7-day windows. Prompt caching: 50 percent off cached input tokens with no extra caching fee.
Verified 2026-06-12 via groq.com/pricing and Groq supported models.
Failure modes
- Open-weight only. Groq hosts open-weight models including OpenAI’s GPT OSS 20B and 120B, but no frontier ChatGPT, no Claude, no Gemini. If your product needs a closed frontier model, Groq is complementary, not a replacement.
- Free tier rate limits bite. 30 req/min is enough for prototyping, not production. Plan upgrade.
- Model catalog is narrower than FLUX marketplaces. Curated selection of flagship open-weight models, not every model on Hugging Face.
- Model catalog changes. Groq’s supported-model table includes production and preview routes; check model IDs, deprecations, context limits, and rate limits before pinning a production workload.
- LPU geography is limited. Not globally distributed in 2026 at the level of AWS or GCP. Latency is great near a Groq region, less great far from one.
Against the alternatives
| Groq | Fireworks AI | Together AI | OpenAI | |
|---|---|---|---|---|
| Speed (tok/sec) | 394 to 1,000 | 50-200 | 50-200 | 50-100 |
| Hardware | Custom LPU | Blackwell GPUs | H100/H200 | OpenAI infra |
| Llama 4 Scout input | $0.11/M | ~$0.15/M | ~$0.20/M | N/A |
| Proprietary models | No (open-weight + GPT OSS) | No | No | Yes |
| Best for | Latency-critical open-weight | General open-weight inference | Fine-tuning + hosting | Frontier quality |
Methodology
Produced by the aipedia.wiki editorial pipeline. Last verified 2026-06-12 against Groq pricing, Groq docs, and Groq supported models.
FAQ
Is Groq the same as Grok? No. Groq (this page) is a hardware-accelerated LLM inference provider founded in 2016. Grok is xAI’s chatbot and API platform launched in 2023. Different companies, different products, easy to confuse because of the single-letter spelling.
Is Groq really 10× faster than other providers? On open-weight models, the LPU hardware delivers 3-10× higher tokens/second than GPU-based providers. Real-world advantage depends on model, context length, and region.
What’s an LPU and how is it different from a GPU?. Unlike GPUs (which are general-purpose matrix-math chips), LPUs are optimized for the specific compute, and lower cost per token on supported models.
Was Groq acquired by Nvidia? AiPedia is not treating acquisition rumors as current buyer facts. Use Groq’s official site, pricing page, and docs for purchase decisions unless Groq or Nvidia publish a primary-source announcement.
Can I run Llama 4 Scout’s 10M context on Groq? Groq supports long context on some models but not always the full 10M. Check current model specs on Groq’s docs; the effective context window varies.
Related
- Category: AI Chatbots
- See also: Fireworks AI · Together AI · Fal.ai · Llama