How Can We Effectively Preserve and Archive Public Data?

February 20, 2025

Rupert Marais is an in-house Security specialist with expertise in endpoint and device security, cybersecurity strategies, and network management. In this interview, Rupert will provide insights into the preservation of government web pages and data, the impact of administration changes on digital archiving, and the ongoing challenges faced by digital preservation initiatives.

Can you start by explaining the importance of preserving government web pages and data?

Preserving government web pages and data is crucial because it ensures that vital public information remains accessible and transparent. Government data is not only a resource for citizens to stay informed but is also used by researchers, historians, and policymakers to analyze and understand societal trends and make informed decisions.

What specific changes took place when the Trump administration came into office regarding government websites and databases?

When the Trump administration assumed office, more than 8,000 pages across several government websites and databases were taken down. References to gender and diversity initiatives were purged, and certain websites like the U.S. Agency for International Development (USAID) remained offline.

Why were certain references, such as those to gender and diversity initiatives, removed from these web pages?

The removal of references to gender and diversity initiatives was likely a reflection of the new administration’s policy priorities and ideological stance. These changes illustrate how government websites can reflect the current administration’s values and objectives, leading to the erasure of previously emphasized information.

How did the federal judges’ ruling on February 11 impact the restoration of public data?

The federal judges’ ruling on February 11 mandated that government agencies restore public access to pages and datasets maintained by the Centers for Disease Control and Prevention (CDC) and the Food and Drug Administration (FDA). This ruling reinforced the importance of keeping public data accessible and countered the administration’s efforts to withhold information.

Why was the Justice Department’s argument that the Internet Archive’s Wayback Machine could make the information accessible not persuasive to the judge?

The judge was not persuaded by the Justice Department’s argument because the Wayback Machine requires users to know the original URL of an archived page to access it. This limitation makes the Wayback Machine an impractical solution for someone seeking information without prior knowledge of specific URLs.

Can you tell us more about the role of the Internet Archive and how it operates to preserve web content?

The Internet Archive is a non-profit organization dedicated to providing universal access to knowledge. It records more than a billion URLs every day and maintains the End of Term Web Archive, which documents changes to federal government sites during administration transitions. By capturing vast amounts of web content, the Internet Archive plays a critical role in preserving digital history.

What is the End of Term Web Archive, and how does it relate to the Internet Archive’s efforts?

The End of Term Web Archive is a collaborative project that captures and preserves changes to U.S. federal government websites at the end of presidential terms. This initiative, hosted by the Internet Archive, ensures that historical snapshots of government web content are accessible, providing a record of digital information shifts across different administrations.

How does the Internet Archive manage to record more than a billion URLs every day?

The Internet Archive uses automated web crawlers to systematically browse and record web pages. These crawlers work around the clock, capturing snapshots of websites to build a comprehensive digital archive. The sheer scale of this operation allows the Archive to preserve enormous amounts of information daily.

Why is it important to have collaborations like the Environmental Data and Governance Initiative and the Association of Health Care Journalists in these archiving efforts?

Collaborations with groups like the Environmental Data and Governance Initiative and the Association of Health Care Journalists enhance the archival process by providing expert analysis and documentation of changes. These partnerships help identify crucial data that might otherwise be overlooked and improve the accuracy and comprehensiveness of the archives.

How does the Library Innovation Lab’s project, particularly the data.gov archive, complement existing web crawls?

The Library Innovation Lab’s project complements existing web crawls by focusing on data sets that interactive web services drive. By going into APIs, the Lab can fetch and archive data directly from the source, ensuring that even the most complex and interactive content is preserved in a usable format, which might be missed by traditional web crawls.

What challenges arise in archiving interactive web services and databases, and how does the Library Innovation Lab approach these challenges?

Archiving interactive web services and databases presents challenges because such content often requires user interaction, JavaScript execution, or form submissions. The Library Innovation Lab overcomes these obstacles by directly accessing APIs, which allows them to retrieve data without depending on web pages and circumvent potential barriers posed by interactivity.

Could you explain the process of going into APIs to access and archive data sets?

Accessing and archiving data sets through APIs involves writing scripts that send queries to the APIs to retrieve data. For example, the Library Innovation Lab’s project with data.gov involved sending 300 specific queries that each fetched 1,000 items. These queries provided direct access to comprehensive data sets that were then archived efficiently.

What methods does the Library Innovation Lab use to ensure the data is captured in a usable format?

To ensure data is captured in a usable format, the Library Innovation Lab employs automated scripts and tools that convert the retrieved data into formats that can be easily analyzed, such as CSV or Excel files. This process ensures that the archived data remains accessible and functional for researchers and users.

What is the principle of LOCKSS and how is it applied in the preservation of digital data?

LOCKSS, or Lots Of Copies Keep Stuff Safe, is the principle of maintaining multiple copies of data in various formats and locations to ensure its longevity and security. This approach reduces the risk of data loss by diversifying storage media, ownership, and geographic distribution.

Can you elaborate on the measures taken by the Internet Archive to secure and replicate the data they preserve?

The Internet Archive employs several measures to secure and replicate data, including maintaining multiple copies in different physical locations, both domestically and internationally. These copies are stored on various types of media and controlled by different organizations, enhancing the resilience and security of the archived data.

How does the preservation of US government data benefit people around the world?

US government data benefits people globally by providing valuable information on topics such as health, energy, agriculture, and security. Researchers, policymakers, and individuals worldwide can access this data to gain insights, make informed decisions, and contribute to their fields of study or interest.

Why is it beneficial for data copies to be diverse across different metrics?

Diversity in data copies enhances security and reliability. By storing data on different media types, controlled by different entities, with various funding sources, the risk of simultaneous loss or corruption decreases. This approach ensures that at least some data copies will survive potential threats or failures.

How are cryptographic signatures and timestamps used in archiving to verify the validity of the data?

Cryptographic signatures and timestamps provide proof of the origin and creation time of archived data. Each time an archive is created, it is signed with cryptographic proof of the creator’s email address and the timestamp, which helps verify the authenticity and integrity of the archived information.

How has the removal of material from federal websites under the Trump administration differed from previous administrations?

The removal of material from federal websites under the Trump administration has been more extensive and chaotic compared to previous transitions. While changes have occurred with each new administration since Bill Clinton’s presidency, the scale and scope of removals under the Trump administration have been significantly higher.

What ongoing challenges do digital archivists face, especially with older websites and evolving internet standards?

Digital archivists face numerous challenges, including dealing with the backlog of older websites, adapting to evolving internet standards, and ensuring compatibility across different formats. Additionally, archivists must navigate the complexities of interactive and dynamic web content, which requires innovative solutions like API access to capture.

How has your work on data preservation projects like this shaped your perspective on the value of public data?

Working on data preservation projects has underscored the immense value of public data for me. Government data acts as a critical navigational tool, providing the information necessary for informed decision-making. Engaging with this data has helped me appreciate its significance and the need to protect and maintain its accessibility.

What do you see as the future of digital archiving and the preservation of government data?

The future of digital archiving and government data preservation will likely involve more sophisticated technology and greater collaboration among various organizations. As digital content becomes increasingly complex, archivists will need to develop new methods to capture, preserve, and verify data. Additionally, the importance of maintaining multiple, diverse copies of data will become even more critical to ensure its longevity and security.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later