No shortcuts to research information citizenship
Over the past decade, the research community has become increasingly comfortable using the language of openness. We talk about open science, open infrastructure, open metadata and open knowledge graphs. But openness, like citizenship, is not something that can be claimed by declaration alone. It is something that must be earned, maintained, and continually renewed.
Research information citizenship is not just about whether data is easy to access. It is also about ensuring that our data can be used responsibly, lawfully, and sustainably — without hidden dependencies, unacknowledged permissions, or fragile assumptions. In other words, data must have clear context. As we move into the latter part of the 2020s, it is worth asking an uncomfortable question: Have we finished building the open research information infrastructure we rely on, or have we merely convinced ourselves that we have?
The infrastructure decades
The 2010s were, by any reasonable measure, a remarkable decade for research infrastructure. During that period the community converged on a shared vision of what good infrastructure should look like. Persistent identifiers became the backbone of scholarly communication. ORCID provided a global solution for researcher identification. DataCite joined Crossref to embed DOIs at the centre of the scholarly record. Later, ROR gave us a long-overdue open identifier for organisations.
This was not just technical progress; it was social progress. These initiatives succeeded not only because they were community-governed, interoperable, and explicitly designed to serve the public good, but because the community was the infrastructure. By the end of the decade, we had something rare in scholarly communication: broad agreement on the rails we wanted to run on.
The 2020s, by contrast, have been a decade of implementation. The challenge has not been inventing new infrastructure, but persuading the community to actually use what already exists — to connect ORCIDs to workflows, to deposit richer metadata, to treat affiliations, funding information, and relationships as first-class citizens rather than optional extras.
As we look towards 2030, the focus will change again. Research information is no longer just supporting discovery and reporting. It is becoming the substrate for assessment reform, large-scale analytics, AI-review systems and even AI-co-researchers. The next decade will not be about experimentation. It will be about interdependence on the sources of information that we implicitly accept as truth. And interdependence demands a far higher standard of robustness.
Open principles and the temptation to claim victory early
In this context, it is entirely understandable that the community is drawn to initiatives that appear to offer access to open metadata completeness today. Large-scale secondary source aggregation projects can be transformative. They demonstrate what is possible when fragmented systems are connected, and they lower barriers for institutions that lack the resources to assemble such views themselves.
OpenAlex is a good example of this. It is an impressive achievement, and it has unquestionably shifted expectations about what an open catalogue of the research system might look like. The risk, however, is that initiatives like this can create a sense that the hard work is largely behind us — that the open research information problem has, for practical purposes, been solved or is at least as good as it is ever going to get.
That sense is misleading.
From the perspective of Digital Science’s Open Principles, continuing to strengthen our core metadata primary sources matters deeply. We explicitly welcome a strong, shared open core for research infrastructure — common identifiers, open metadata, and interoperable standards that everyone can build upon. Indeed, Dimensions would not have been possible without that infrastructure. Dimensions shows that this open core is not a threat to innovation; rather it is what makes innovation at the open data horizon possible. It is why, when given the opportunity, Digital Science chose to make the GRID database CC0, a move that would allow it to become the founding dataset for ROR.
Aggregated secondary sources that include shared infrastructure however, can disguise as much as they reveal. A clean API does not guarantee clean provenance. A comprehensive-looking record does not mean that the underlying components are open in the same way, or under the same conditions. When we treat aggregated completeness as equivalent to the completeness of our open infrastructure, we risk building the next decade of services on a platform that is far shakier than it appears.
Harvested information is a shaky foundation for Open Infrastructure
Open infrastructure creates the conditions for competition, differentiation, and sustainability. Innovation can only thrive if the open layer it is based on is genuinely open, well-governed, trustworthy, and stable.
A recurring problem in research information infrastructure is that key metadata elements — particularly affiliations — are not consistently available from primary sources. Crossref records, while foundational, do not always include affiliation data for authors of research articles. PubMed does, but only for a relatively small subset of the literature. For large swathes of non-medical research, affiliation information exists primarily on publisher websites in an unstructured format, and has not been given freely as part of DOI registration.
The consequence is subtle but significant. To create records that look complete, secondary source aggregators that do not have explicit agreements with publishers are pushed towards harvesting metadata directly from publisher landing pages — often against the terms and conditions of those websites.
The result of this process is that an “open” secondary source provider might not be able to guarantee the continued collection or retention of information collected by website harvesting. This creates an unstable datasource with records at risk of being removed at the request of the publisher as was recently the case with abstracts, or unable to be retrieved in the future due to updates in the technology used by the website provider. Instability in turn creates operational risk for vendors and institutions or even assessment systems as the metadata that they rely on today may not be there tomorrow.
The AI era made the cracks visible
For a long time, these issues remained largely theoretical, or at least invisible to the community. That changed abruptly with the rise of large language models.
In the age of AI, harvesting websites for content is no longer a benign activity. Modern AI agents are designed to ingest entire corpora in order to train, fine-tune, and refresh models. From the perspective of a website operator, this behaviour is indistinguishable from a sustained denial-of-service attack: high request concurrency, aggressive pagination, and repeated full-corpus downloads that overwhelm application servers, databases, and network links, all adding to the costs of providing the service.
For open research infrastructure, the increased burden of AI harvesting is existential. They operate on constrained, often grant-funded budgets where bandwidth, compute, and storage costs are very real. When a swarm of AI scrapers mirror entire collections, the cost of being open rises sharply: increased cloud egress fees, degraded performance for legitimate users, polluted usage metrics, and staff time diverted from stewardship to firefighting. These costs also apply to commercial entities.
In response to widespread, unlicensed data harvesting, many publishers and institutions introduced aggressive bot detection and blocking measures. These were not aimed at research metadata harvesting projects in particular, but they had immediate consequences for them. As an example, at the end of 2025, through a combination of bot detection blocking measures, and problems with a new code base, affiliation data dropped precipitously in the OpenAlex corpus. With affiliation data and other fields suddenly becoming unavailable at scale, the community was given a glimpse of how fragile parts of our supposedly open infrastructure actually are. The problem was not just a temporary outage. It became clear that outside of PubMed corpus, approximately 40% of OpenAlex journal articles in 2025 rely on web scraping affiliation data from publisher websites — a timely reminder that some parts of the metadata ecosystem were only open by tolerance, not by design.
Openness with hidden dependencies
With new anti-AI-bot harvesting protections in place, commercial webscraping offerings are now designed to avoid detection through a range of methods including using rotating residential IP addresses and intelligent agents that pretend to be human. These services, when effective, remove the ability of content providers to determine who they serve content to, or to protect against the service and traffic costs associated with large scale web crawling.
Significantly, in order to restore coverage previously gained through website scraping, the same bot-evasion techniques used by commercial AI agents now need to be employed by metadata harvesting projects. OpenAlex for example uses an external commercial company called Zyte and have claimed affiliations data will be restored later this month. It should also be noted that OpenAlex does not just harvest landing pages for which it is missing affiliation data — it harvests the landing page of every work for which it has a DOI.
As well as making coverage dependent on who is winning the AI-harvesting-bot wars at any given moment, harvesting data in this way also changes the ethical character of the system. Using such data implicitly endorses the companies that are profiting from offering the services needed to enable large-scale harvesting of website content against the explicit wishes of those content providers — something that many institutions would be uncomfortable supporting if asked directly.
There is a deeper structural issue here. When open datasets depend on reservoirs of proprietary information to achieve complete records, it flips the intention of openness. Other open providers cannot compete on equal terms without adopting the same practices. This creates an uneven landscape in which openness becomes performative rather than principled: free to use, but not fully reproducible; transparent at the surface, opaque underneath.
Of course, it need not be this way. Through CrossRef, DataCite, ORCID, and ROR, as well as established standards like OAI-PMH, and a dedication to collective research information citizenship, we have the technology and social infrastructure to ensure that website harvesting is not part of our core open infrastructure. Indeed, initiatives such as OpenAIRE demonstrate what is possible today based on open standards.
Finishing the job we already agreed to do
An open core is not the end of innovation — it is the beginning of it. When the foundations are solid, organisations can compete responsibly at the frontier, building new services, insights, and tools without needing to re-litigate the basics of data provenance and permission. The most valuable work of the coming years will not be building, and investing in the curation of ever-larger mirrors of the research system, but finishing the foundational tasks we already know are necessary.
That means complete, genuinely open metadata in Crossref and DataCite, including affiliations with ROR IDs. It means treating organisational identifiers as mandatory infrastructure, not optional enhancements. It means near-universal ORCID adoption embedded directly into research workflows at institutions as well as publishers. It means continuing to invest in expanding the horizons of our open infrastructure (for instance, expanding our use of CREDiT, and embracing the possibilities of persistent identifiers for projects with RAiD). And it means resisting the temptation to paper over gaps with techniques that compromise the very principles openness is meant to serve.
Research information citizenship, like civic citizenship, is rarely glamorous. It is slow, sometimes frustrating, and often invisible when done well. Proposed solutions that work around members of our community rather than working with them are inherently unstable. Shortcuts here do not save time in the long run; they merely defer the cost, with interest.
As we build the platforms on which the next decade of research evaluation, discovery, and automation will depend, now is the moment for institutions, funders, publishers, researchers, and research service providers to accept their shared responsibility to the open core — and to finish the foundational work properly.
