freellmpool › providers › Groq

Free Groq API: the fast one

Groq's free API is the one developers reach for when latency matters. It runs models on custom LPU hardware that streams tokens noticeably faster than typical GPU inference, and the free tier exposes strong open models — Llama 3.3 70B, Llama 4 Scout, and OpenAI's gpt-oss 20B/120B. You get a key free at console.groq.com/keys and call an OpenAI-compatible endpoint at https://api.groq.com/openai/v1. The main constraint is the free daily/minute rate limit, which is where pooling Groq with other tiers via freellmpool helps.

What Groq's free tier is good for

Because generation is so fast, Groq is a great default for anything interactive: chat UIs, autocomplete, streaming agents, and tool-use loops where round-trip latency compounds. It's less suited to extremely long context windows or multimodal input — for big documents or vision, Gemini is a better free pick. Treat Groq as your low-latency workhorse and keep a higher-context provider in reserve.

Which free model to pick

Model	Use it for
`llama-3.3-70b-versatile`	Best general quality on the free tier
`llama-3.1-8b-instant`	Fastest, cheapest tokens — classification, routing, drafts
`openai/gpt-oss-120b`	Strong reasoning when you can spare the budget
`meta-llama/llama-4-scout-17b-16e-instruct`	Newer Llama 4, long-ish context

Get a key and call it

Sign in at console.groq.com/keys, create a key, and use the OpenAI-compatible route — most OpenAI SDKs work by only changing the base URL:

curl https://api.groq.com/openai/v1/chat/completions \
  -H "Authorization: Bearer $GROQ_API_KEY" -H "Content-Type: application/json" \
  -d '{"model":"llama-3.3-70b-versatile","messages":[{"role":"user","content":"Hi"}]}'

Limits and gotchas

Free usage is capped by requests/minute, requests/day, and tokens/minute per model — the per-minute token cap is often what you hit first on long outputs. Check current numbers in the Groq console; they change.
Model IDs change as Groq rotates its catalog; a model can be deprecated with little notice.
Some models are reasoning/“thinking” models and burn extra output tokens — set a generous max_tokens or you'll get truncated answers.

Pool Groq with other free tiers

Groq's speed is great until you hit the minute/day cap mid-task. freellmpool keeps Groq as a preferred provider and transparently fails over to Cerebras, Gemini, NVIDIA and others on a 429, so a single limit doesn't stall you:

pip install freellmpool
export GROQ_API_KEY=...                 # plus any other free keys
freellmpool ask -p groq "..."           # pin Groq when you want its speed
freellmpool ask "..."                   # or pool + fail over automatically
FREELLMPOOL_ROUTING=fast freellmpool ask "..."   # prefer the lowest-latency tier

See also Cerebras (the other very-fast tier), best free LLM API gateway, and using multiple free LLM APIs together.

FAQ

Is the Groq API free?

Yes. Create a key at console.groq.com/keys and call the OpenAI-compatible endpoint at api.groq.com/openai/v1. Free usage is rate-limited per minute and per day; verify current numbers in the console.

Why is Groq so fast?

Groq runs inference on its own LPU hardware designed for sequential token generation, which gives much higher tokens-per-second than typical GPU serving — useful for interactive and streaming workloads.

What's the best free Groq model?

For general quality, Llama 3.3 70B Versatile; for speed and cheap tokens, Llama 3.1 8B Instant; for harder reasoning, gpt-oss-120b.

Part of freellmpool (MIT, open source). Limits and model IDs change — check Groq's docs. Updated 2026-06-03.