TixelJobs
G
G2Ivia Ashby

Senior Engineer - AI Evaluator

MiamiPosted 1w ago
OtherSeniorFull-time

Not sure if you're a good fit?

Upload your resume and TixelJobs AI will compare it against Senior Engineer - AI Evaluator at G2I. Get a match score, missing keywords, and improvement tips before you apply.

Free preview · Your resume stays private

About the Role

SENIOR AI INTERACTION EVALUATOR (CODEX / CLAUDE CODE)

Contract | $100–$200/hour | 10–20 hrs/week | Start ASAP (through early May)

Check out this Loom video for more details! https://www.loom.com/share/b0d1b0bf24c44ae8b95dca84b9db60e5

We’re looking for highly experienced software engineer (SR+) to help evaluate the quality of interactions with modern coding agents such as OpenAI Codex and Claude Code.

This is not a traditional engineering role.

You won’t be writing production code.
You’ll be evaluating something harder: whether the model thinks like a great engineer.


WHAT THIS ROLE ACTUALLY IS

You will assess how AI coding agents behave in real-world scenarios — focusing on:

- Whether the response makes sense

- Whether the preamble and reasoning are useful

- Whether the output reflects strong engineering judgment

- Whether the interaction feels right to an experienced developer

This role is about engineering taste — not syntax correctness.


WHAT YOU’LL BE DOING

- Evaluate AI-generated coding interactions end-to-end

- Judge whether outputs are:

- Useful

- Correct (at a high level)

- Aligned with how a strong engineer would think

- Assess the quality of explanations and reasoning, not just code

- Distinguish between different levels of response quality (e.g. what makes something a 2 vs 4)

- Provide clear, opinionated feedback on:

- What worked

- What didn’t

- What felt “off” or misleading

- Help define what great looks like when interacting with tools like Cursor


WHAT WE MEAN BY “TASTE”

We’re specifically looking for engineers who can answer questions like:

- Does this feel like something a strong engineer would actually say?

- Is this explanation helpful, or just technically correct?

- Is the model guiding the user well, or just dumping output?

- Would this interaction build or erode trust?

You should be comfortable making subjective but rigorous judgments.


WHO YOU ARE

- Staff / Principal-level engineer (or equivalent experience)

- Strong background in one of the below:

- TypeScript / JavaScript

- Python

- Hands-on experience using:

- OpenAI Codex

- Claude Code

- Cursor

- Deep familiarity with modern AI-assisted dev workflows

- Able to evaluate code without needing to fully execute or deeply review every line

- Comfortable giving direct, opinionated feedback

- High bar for what “good engineering” looks like


NICE TO HAVE

- Experience with tools like Cursor or similar AI-first IDEs

- Prior exposure to prompt design or evaluation workflows

- Experience mentoring senior engineers or defining engineering standards


ENGAGEMENT DETAILS

- Rate: $100–$200/hour

- Hours: ~10–20 hours/week

- Duration: Through early May (with possible extension)

- Start: ASAP

- Process:

- Take-home evaluation exercise

- One behavioral interview
Share