freellmpool › providers › Groq
Groq's free API is the one developers reach for when latency matters. It runs
models on custom LPU hardware that streams tokens noticeably faster than typical GPU inference, and the free
tier exposes strong open models — Llama 3.3 70B, Llama 4 Scout, and OpenAI's gpt-oss 20B/120B. You get a key
free at console.groq.com/keys and call an OpenAI-compatible
endpoint at https://api.groq.com/openai/v1. The main constraint is the free daily/minute rate
limit, which is where pooling Groq with other tiers via
freellmpool helps.
Because generation is so fast, Groq is a great default for anything interactive: chat UIs, autocomplete, streaming agents, and tool-use loops where round-trip latency compounds. It's less suited to extremely long context windows or multimodal input — for big documents or vision, Gemini is a better free pick. Treat Groq as your low-latency workhorse and keep a higher-context provider in reserve.
| Model | Use it for |
|---|---|
llama-3.3-70b-versatile | Best general quality on the free tier |
llama-3.1-8b-instant | Fastest, cheapest tokens — classification, routing, drafts |
openai/gpt-oss-120b | Strong reasoning when you can spare the budget |
meta-llama/llama-4-scout-17b-16e-instruct | Newer Llama 4, long-ish context |
Sign in at console.groq.com/keys, create a key, and use the OpenAI-compatible route — most OpenAI SDKs work by only changing the base URL:
curl https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" -H "Content-Type: application/json" \
-d '{"model":"llama-3.3-70b-versatile","messages":[{"role":"user","content":"Hi"}]}'
max_tokens or you'll get truncated answers.Groq's speed is great until you hit the minute/day cap mid-task. freellmpool keeps Groq as a preferred provider and transparently fails over to Cerebras, Gemini, NVIDIA and others on a 429, so a single limit doesn't stall you:
pip install freellmpool
export GROQ_API_KEY=... # plus any other free keys
freellmpool ask -p groq "..." # pin Groq when you want its speed
freellmpool ask "..." # or pool + fail over automatically
FREELLMPOOL_ROUTING=fast freellmpool ask "..." # prefer the lowest-latency tier
See also Cerebras (the other very-fast tier), best free LLM API gateway, and using multiple free LLM APIs together.
Yes. Create a key at console.groq.com/keys and call the OpenAI-compatible endpoint at api.groq.com/openai/v1. Free usage is rate-limited per minute and per day; verify current numbers in the console.
Groq runs inference on its own LPU hardware designed for sequential token generation, which gives much higher tokens-per-second than typical GPU serving — useful for interactive and streaming workloads.
For general quality, Llama 3.3 70B Versatile; for speed and cheap tokens, Llama 3.1 8B Instant; for harder reasoning, gpt-oss-120b.