How Automated NLP Pipelines Cut Oncology Data Abstraction from Weeks to Hours

Abhijit Nayak, Senior Data Scientist at Cognizant and IEEE conference speaker, discusses building production-grade information extraction systems for cancer research and why domain expertise matters more than model size.

A July survey in Artificial Intelligence Review analyzed 156 NLP studies in oncology and identified a pattern: transformer models perform impressively on research benchmarks, then collapse when deployed in clinical workflows. ClinicalBERT extracts cancer diagnoses accurately from curated pathology reports. The same architecture fails when hospital documentation varies by physician, institution, and department. The technical foundations are stronger than ever. The systems still don't work in production.

The pattern is familiar across healthcare AI: impressive benchmarks on curated datasets, followed by friction when the same systems meet real-world conditions. In oncology, where 80% of the data needed for treatment decisions and research sits in unstructured clinical notes, this gap has consequences. Cancer registries fall behind. Clinical trial matching slows. Treatment insights that could inform care remain buried in millions of documents that no one has time to read manually.

Abhijit Nayak, Senior Data Scientist (NLP) at Cognizant, builds extraction pipelines that actually survive contact with messy hospital data. His systems process millions of oncology records—extracting diagnoses, biomarker results, treatment timelines—with the validation logic and audit trails clinical environments demand. This year, he's presenting research on LLM reproducibility and prompt optimization at IEEE conferences in Vienna and Singapore. We discussed what kills NLP systems when they move from paper to production, how domain expertise catches edge cases that larger models miss, and why understanding oncology documentation patterns matters more than foundation model parameter counts.

— A July survey in Artificial Intelligence Review analyzed 156 NLP studies in oncology and found a consistent pattern — models that perform well in research rarely survive contact with clinical workflows. You build extraction pipelines that process millions of clinical notes. What actually kills these systems when they move from paper to production?

— Honestly, it starts with something boring — the data just looks completely different. When you read a research paper, they are trained on a dataset where everything is nicely formatted, sentences are complete, and terminology is consistent. And then you get a real pathology report, and it's a mess. One physician writes tumour staging in a table, while another places it somewhere in the middle of a paragraph with abbreviations I've never seen before. Clinical notes often include phrases like "see prior results" without actually repeating the values. You're extracting the same type of information, but the way it's written varies significantly across institutions, departments, and sometimes even among individual doctors.

And then there's all the infrastructure that nobody writes papers about, because it's not novel, it's just work. You need ingestion, pre-processing, extraction, normalization to standard terminologies, validation logic, and audit trails. Academic benchmarks focus on F1 scores for entity recognition. But in production, if your normalization step silently fails on an unusual input, the whole downstream analysis is wrong — and in oncology, that can mean a missed biomarker or an incorrect treatment timeline.

But I think the hardest part is actually earning trust from the clinical side. These are people who have been doing manual abstraction for years. They know every edge case, every exception. If your system hallucinates once, if it misses something obvious, you've lost them. So you end up building all this explainability infrastructure, showing source sentences, confidence scores, and flagging ambiguous cases. None of that gets published because it's engineering, not research. But without it, nothing deploys.

— Your pipelines extract diagnoses, tumor characteristics, treatment regimens, biomarker results, therapy timelines — all from unstructured text. A pathology report from one physician may look entirely different from a clinical note from another. How do you build systems that handle that variability and still hit the accuracy that clinicians will actually trust?

— You don't solve it with one model. That's the first misconception — people think you train a big transformer, throw documents at it, and it figures everything out. Doesn't work that way in oncology. The variability is too high, and the cost of errors is too high.

What actually works is breaking the problem into smaller pieces. Pathology reports need different handling than radiology summaries. Progress notes are their own beast. So you build specialized components — one module focuses on tumor staging, another on treatment regimens, another on biomarker extraction. Each one is tuned for its specific document type, its specific terminology patterns.

And then you layer validation on top. Medical logic checks — does this staging make sense for this cancer type? Does this treatment timeline align with what we extracted about the diagnosis date? If something looks off, it gets flagged. Not rejected automatically, just flagged for review. Because sometimes the weird case is actually correct, and sometimes your model made a mistake. You want a human making that call, not the system silently picking one interpretation.

The trust piece comes from transparency. When we surface an extracted value, we show exactly where it came from — the sentence, the document, the date. Clinicians can click through and verify. They're not being asked to trust a black box. And over time, when they see the system getting it right consistently, when they see it catching things they might have missed in a 50-page record — that's when adoption actually happens.

— You've described your systems as production-grade pipelines with MLOps, monitoring, and evaluation standards. Since 2022, you've been leading AI/ML strategy for healthcare projects at Cognizant — deciding which use cases to prioritize, which architectures to standardize. What does it actually take to move an oncology NLP system from prototype to something a research team relies on daily?

— Versioning, monitoring, and a correction pipeline that actually closes the loop. Every extraction needs to be reproducible months later — using the same model version, configuration, and preprocessing. In regulated environments, "we updated the model" isn't an answer. Monitoring catches drift before users do — new report templates, different documentation styles, accuracy drops on specific cancer types. We had tumor staging extraction degrade after one site changed its pathology format. Caught it in dashboards within days.

The feedback loop is often what teams overlook. Clinicians flag errors, those corrections feed back into training data, models get retrained, and performance improves. Sounds obvious, but operationalizing it requires tooling — annotation interfaces, data pipelines, retraining schedules. We spent months building that infrastructure before it began to pay off.

The actual prioritization decisions come down to clinical impact versus technical feasibility. Some extractions are high-value but extremely hard, like parsing free-text treatment modifications. Others are easier wins. You sequence the roadmap so early deployments build credibility while you tackle the more complex problems in parallel.

— Later this year, you're presenting at two IEEE conferences — FMLDS in Vienna on LLM reproducibility through three-way caching, ICNGN in Singapore on prompt optimization for sentiment analysis. How do these connect to your oncology work, or are they parallel tracks?

— They're directly connected, just abstracted. The reproducibility paper emerged from a real-world production problem — LLM outputs aren't deterministic, as the same prompt yields slightly different results across runs. In research, that's noise. In clinical pipelines where audit trails and reproducible extractions are required, it's a blocker. The caching architecture we developed solves that at the infrastructure level.

The prompt optimization work is about getting consistent performance without fine-tuning. In healthcare, you often can't ship patient data to external APIs for model training. So you need prompting strategies that work reliably out of the box. The emoji research sounds playful, but the underlying question is serious — how do you engineer prompts that produce stable, predictable outputs across different input distributions?

Both papers address problems I hit in production first. The academic framing came later.

— You've served as a judge at Devpost AI hackathons alongside panelists from Netflix, Meta, and Google. When you're evaluating projects from younger teams, what separates a solution that looks impressive in a demo from one that could actually be deployed?

The first thing I look at is what happens when inputs break. Demo projects always show the happy path — clean data, expected behavior, impressive results. However, deployable systems need to fail gracefully and recognise when they are uncertain. In healthcare submissions, I specifically watch for edge case thinking — a 95% accurate classifier means nothing if failures cluster around rare conditions where misclassification actually kills someone. Strong teams establish confidence thresholds and human review triggers from the outset. And you can always tell when a team talked to real users versus just built for the demo. The architecture decisions are completely different.

— Beyond healthcare, you've built foundational AI models for startups in the US philanthropic sector. That's a sharp contrast — oncology is life-or-death, philanthropy is social impact. How transferable are the methods?

— More transferable than you'd expect. Philanthropic organizations sit on massive amounts of unstructured data — grant applications, impact reports, program narratives. The same core problem: critical information is buried in text that nobody has time to read manually. The extraction pipelines I built for oncology — document classification, entity recognition, normalization — adapt directly. What changes is the ontology, not the architecture. In oncology, you're extracting tumor staging and biomarker values. In philanthropy you're extracting funding amounts, program outcomes, and geographic focus. The validation logic differs, the domain dictionaries are distinct, but the engineering patterns remain the same. And honestly, working across domains makes you better at both. You stop over-fitting your thinking to one problem space.

— The subheadline of this interview is *"why domain expertise matters more than model size." In a field where every month brings a new LLM with more parameters, that's a contrarian position. For someone building a career in healthcare AI, should they focus on the latest foundation models or invest in understanding the medical domain itself?*

Domain expertise, without question. I've seen teams use GPT-4+ on clinical notes and achieve mediocre results because they don't fully understand what they're extracting. They can't tell when the model hallucinates a biomarker value that makes no clinical sense. They don't know which errors are catastrophic and which are tolerable. Meanwhile, someone who understands oncology documentation patterns, knows how tumor staging works, and can read a pathology report — that person builds better systems with smaller models. The foundation model is a tool. Knowing what to make with it, knowing how to validate outputs, knowing where the edge cases hide — that's the hard part, and it comes from domain knowledge. Chase the models, and you're always behind. Invest in the domain, and you're always valuable.

How Automated NLP Pipelines Cut Oncology Data Abstraction from Weeks to Hours

Dan Agbo

Related