3. Data Governance, Management, and Operations
An operational asset for agents and a strategic asset for
society as a whole.
How It (Doesn’t) Work Today
Managing information has long been a core function of government. Accurate, timely and well-governed data is essential for effective public services delivery and the government's decision-making processes leading to better policies.
Data is also one of the government's core platform offerings. The public sector plays a critical societal role in organising, certifying and making accessible the information that modern societies and economies need. This is exemplified by institutions such as public libraries, statistics bureaus, property and business registries, copyright offices and open data repositories.
Governments have struggled to keep pace with the exponential growth and complexity of Big Data, which makes up the vast, diverse and rapidly flowing streams of information that underpin modern AI. Most governments have focused on managing structured register data (i.e. small data), the immense potential of Big Data has largely remained untapped.
Even when the technical foundations are in place, many governments fail to make meaningful use of their data assets, held back by gaps in leadership, skills and mindset. In the UK, for instance, 45 percent of public sector organisations still lacked a formal data strategy in 2023, while 73 percent kept their most valuable data on-premises, thereby limiting access and integration.
A Vision for a Data Layer Supporting Agentic Government
Over the past decade, AI has reshaped our understanding of what is considered ‘high-value’ data. By enabling AI agents, all data becomes high-value. Information hidden in unstructured data becomes analysable and actionable. Viewed in aggregate, a nation’s data merits treatment as a core strategic asset and national infrastructure on par with health, education, transport, or energy.
Enabling agentic government requires a fundamental update in how the public sector governs and uses data:
Storage, computation and interoperable sharing: For both training and operational purposes, public administrations must secure real-time machine access to vast, diverse data pools, including records, documents, images, sensor streams, and logs. Governance measures, including privacy protections and purpose restrictions, are fine-grained, applied not to entire datasets but to individual fields and rows. Access and usage are governed by policies that evaluate each specific query and use case.
Ecosystem-level data fabric: This allows AI agents to integrate public data with consented private sector data. For instance, combining real-time log data across government entities and critical infrastructure could enable earlier detection of coordinated cyber attacks. These real-time data flows are the lifeblood of these agentic workflows, enabling live insights and responsive services to be offered both by public and private actors.
Identity: Unique, standardised identifiers for individuals, locations, legal entities, and physical assets help AI agents develop a holistic and accurate understanding of the entities they interact with. Such an identity framework also ensures the ‘once-only’ principle can be effectively extended to AI-driven interactions.
Metadata: Rich, machine-readable metadata provides essential context for each data asset. This extends beyond basic descriptions to include clear usage policies, data quality indicators, and a complete audit trail (so-called lineage) of how data has been generated and transformed. This makes AI-driven government decisions explainable, auditable, and trustworthy.
Agent infrastructure: Beyond individual data elements like identity and metadata, enabling an effective agentic government requires so-called Agent Infrastructure: the technical backbone governing how AI agents interact with their environment, each other, and human institutions. Such infrastructure must support ‘attribution’, i.e. linking agent actions to responsible entities, shape agent interactions to be safe and efficient, and provide mechanisms to detect and remedy harms.
Openness: An ‘open by default, closed by exception’ stance should guide data policy. Public datasets, metadata, logs, source code, models, training data, and even weights should be made available wherever possible. This allows citizens, businesses, and oversight bodies to understand how AI systems inform government decisions, fostering public trust and democratic legitimacy. Widespread openness further accelerates innovation by enabling academia, private companies, and civic technologists to build upon government AI work.
Data commons: Governments can host the creation and stewardship of high-value, curated (multi)national training datasets to support responsible AI developments. These include collections of local-language text (e.g. public documents, service interactions, or educational materials), annotated public health images, depersonalised legal or administrative texts, and satellite imagery. These should be permissively licensed for use by academia, domestic AI vendors, and public sector innovators, with guardrails in place to protect privacy and prevent misuse.
Two cross-cutting practices are essential to make all this work:
1) Data Product and service management: Every major public dataset (such as core registries, key operational data streams, or critical unstructured information) should be treated as a distinct data product or service, with a designated public steward, public quality standards (e.g. for freshness, accuracy, and bias), and automated reliability monitoring. This ensures government agents operate trusted, well-governed inputs.
2) Agents as data scientists: AI agents can augment the government's limited data science capacity by autonomously analysing patterns, generating insights, and surfacing anomalies, helping to overcome the scarcity of human data scientists.
One frontier role for public sector agents is acting as data fiduciaries for individuals — not merely querying data but maintaining dynamic personal models that help citizens interact with the state. Just as banks maintain financial models for creditworthiness, public agents could maintain administrative models of eligibility, preference, and risk, continually updated with consent, and used to streamline interactions with public systems. This shifts the burden of data navigation away from citizens and toward agentic intermediaries operating under fiduciary obligation.
What the Private Sector is Doing with DataEmbed computation next to data rather than shipping data out. Data governance and use in enterprise settings are rapidly growing more sophisticated and complex. Here is a snapshot of tools and practices currently gaining broad traction: Data as a product and data contracts: Beyond just being a resource, data assets are managed like commercial products. Each ‘data product’ (e.g. a specific dataset or real-time stream) has a designated owner, a publicly defined schema, and service level agreements (SLA) detailing its quality, freshness, and reliability. ‘Data contracts’ formalise these terms between data producers and consumers, with automated tests ensuring compliance. Real-Time API fabrics and event streaming: The paradigm has shifted from slow, periodic batch data transfers to live ‘event streams’ and API-first architectures. This enables AI agents, applications, and analytical systems to access and react to data instantly. Technologies like zero-copy sharing allow secure, live access to data across organisational boundaries without costly and risky physical duplication. In-platform governance with automated lineage and policy: Modern data platforms embed governance directly into their architecture. Data lineage (tracking the origin, transformations, and usage of data at a granular level) is captured automatically. Access controls, privacy rules and usage policies are enforced programmatically and continuously, making compliance auditable by design rather than an external check. Privacy-enhancing technologies (PETs) for secure collaboration: To collaborate on sensitive data without compromising privacy or commercial confidentiality, enterprises use PETs. Secure Data Clean Rooms, for example, provide controlled environments where multiple parties can pool and analyse their data (or run AI models on it) without any participant seeing another's raw data. An example could be researchers running analytical code where sensitive census microdata lives, receiving only aggregate, anonymised results. Federated Learning allows AI models to be trained on decentralised datasets (e.g. across different hospitals, or company branches) by sending the model to the data, learning locally, and then aggregating insights centrally without exposing the source data. Other PETs, like differential privacy and homomorphic encryption, are also gaining traction. Curated corpora and synthetic data for generative AI: Enterprises strategically invest in creating high-quality, specialised datasets, or corpora (e.g. internal documents, customer interactions, research notes) to finetune foundation models for their specific industry and tasks. They also increasingly use Generative AI itself to create realistic synthetic data when real-world data is scarce, sensitive, or imbalanced. |
c. Key Questions
Where do we draw the privacy–utility line in an agentic world? As AI agents learn and act in real time, how should we govern access to sensitive data? What determines whether a dataset should be open, restricted to clean-room environments, or closed altogether, and who defines the rules of use at the speed of automation?
When do you build on what you have vs. when do you start over? Should governments incrementally adapt legacy data infrastructure, or is now the moment for deeper re-engineering? What is the tipping point between patching what exists and building what is truly needed for AI agents at scale?
Why have so many previous data strategies failed to stick politically? Despite countless initiatives, data often remains an afterthought in digital transformation. What makes it so hard to mobilise leadership around foundational data work?
How do we prevent openness from becoming a vulnerability? ‘Open by default’ is a powerful norm — but how do we prevent malicious use of openly accessible logs, models, or weights?
Which types of data sharing maximise third party innovation? What is needed for this, terms of content, format, and delivery infrastructure?