Research Workflow: Where Agents Actually Help

Most research time is not spent thinking about ideas. It is spent checking prior work, testing hypotheses, debugging silent bugs, organizing results, making slides. These are the parts where weeks of work disappear without being noticed.

You run an experiment for three days, only to realize the loss was computed on the wrong dimension. You add a new module that improves results, then discover the baseline was undertrained. You spend a Friday afternoon making slides for a Monday meeting, copying screenshots from terminal outputs into PowerPoint.

Eventually, you internalize a set of checks. You learn what to look for when training curves behave unexpectedly. You run baselines with equal budget before claiming improvement. You learn that a 0.72 mean AUC might hide three classes at 0.9 and five at 0.5.

These lessons are expensive: GPU hours, paper rejections, and months of misdirected effort.

That is where agents can help. Not by driving the research itself, but by acting as guardrails for parts that are mechanical, error-prone, and tedious.

I want to be precise about the distinction. When someone asks an LLM “can you come up with a method that beats X?”, the output is usually unreliable. But when someone says “my method should in principle beat X, but it does not—what is going wrong?”, the problem becomes structured.

Structured diagnosis is exactly what an agent can do well. It can read the code, check the training loop, search for known bugs, and report back.

This is the difference between asking an agent to think and asking it to investigate. The first produces hallucination. The second produces useful work.

· · ·

I recently built a set of simple agent skills for this purpose. They are markdown files that define behavior—no complex infrastructure. Two skills, each targeting a specific source of wasted time.

The first is /research-collaborator. It enforces a simple rule: the agent does the investigative work—reading code, analyzing logs, searching literature, diagnosing failures—and the researcher makes the decisions.

The rules are opinionated, because research methodology is. Every hypothesis needs a kill criterion. Otherwise it fails slowly. Predictions must be recorded before running experiments—otherwise it is too easy to rationalize results after the fact, the problem known in empirical research as HARKing. Novelty assessment without search is treated as unreliable. The cheapest killing test runs first—a fast proxy at 5,000 steps instead of a full run at 200,000.

When results confirm the hypothesis, the agent looks for bugs with the same scrutiny as for negative results. This is discipline that experienced researchers apply instinctively, but that is easy to skip under pressure.

The skill also defines a diagnostic order: bugs first, then hyperparameters, then data, then metrics, then core mechanism. You do not move to step four without clearing steps one through three.

The bug-checking step draws on a curated list of roughly 190 silent bugs across more than thirteen model families—transformers, diffusion models, GANs, RL, GNNs, contrastive learning, NeRFs, flow matching, and others. These are not hypothetical. Each entry includes the symptom, the detection method, and the fix. Many link to specific GitHub issues and papers where the bug was first identified. The universal tier alone covers classics that have cost the community thousands of GPU hours: nn.ModuleList versus plain Python lists, NumPy seeds not propagating to PyTorch workers, the difference between view and permute, softmax applied to the wrong dimension.

The second source of wasted time is more mundane but just as persistent. Every week, I need to scan experiment folders, collect plots, extract metrics, and assemble slides. It is not intellectually demanding, but it takes time, and it is usually postponed until Sunday evening.

The /results-to-slides skill automates this. It scans git history and outputs, organizes results chronologically, and generates a slide-by-slide script. After review, it produces Marp-compatible markdown that converts to PowerPoint. The researcher reviews; the agent assembles.

· · ·

The broader point is not about these specific skills. It is about where agents are actually useful. The temptation is to aim high—generate ideas, write papers, discover methods. But that is where they are weakest.

The highest-leverage use is in preventing avoidable mistakes: catching a bug before a multi-day run, verifying that comparisons are fair, assembling results without wasting attention.

There is a pattern here. As researchers gain experience, their edge comes less from technical ability and more from knowing what usually goes wrong. A senior PhD student does not write fundamentally different code—they just recognize failure patterns earlier.

This accumulated intuition is exactly what can be encoded and applied consistently.

I do not think agents will make the hard parts of research easier. Formulating a good research question, choosing the right problem, knowing when to abandon a direction—these require judgment that no agent currently has.

But the mechanical parts, where the checks are already known, are well-suited for delegation. Not to save effort, but to preserve attention for the work that actually requires it.

The skills are open-source. They are markdown files. You can read them, modify them, disagree with them. They are a starting place, not a finished product.

saidwivedi/research-skills

Comments