Research

Schema Lineage Extraction at Scale

Fine-grained lineage extraction and evaluation for complex enterprise data pipelines.

What Is It?

This project introduced a framework for extracting fine-grained schema lineage from enterprise data pipelines, where multilingual scripts and semantic drift make downstream reasoning difficult. It combined a new annotated benchmark with evaluation methods designed to measure extraction quality more faithfully.

Why It Matters

The work turned a messy, high-friction infrastructure problem into something measurable and comparable, making it easier to evaluate model quality and reason about lineage extraction at scale.

Tools & Technologies

Python Multilingual NLP Benchmarking LLM Evaluation

Read Paper

Highlights

Built a benchmark with 1,700 manually annotated lineage examples.
Introduced a composite evaluation metric to better reflect real extraction quality.
Compared 12 language models across a difficult enterprise reasoning task.

Outcome

The project was accepted to the NeurIPS 2025 LLM Evaluation Workshop and the DL4C Workshop.