Version française
Source:
Library of Congress at http://www.digitalpreservation.gov/news/2009/20090604news_article_warc.html
June 4, 2009 -- The WARC file format is now approved as an
international standard: ISO 28500:2009.
For years, heritage organizations have tried to find the most appropriate
ways to collect and monitor World Wide Web material using web-scale tools. At
the same time, these organizations were concerned with the requirements for
archiving large numbers of born-digital and digitized files. They needed a
container format that enabled one file to carry a large and varied number of
data objects for storage, management and exchange.
The WARC format meets this need and is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It is an
extension of the ARC format, which has been used since 1996 to store files harvested on the web. WARC
format offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for every
contained file, the management of duplicates and of migrated records, and the
segmentation of the records. WARC files are intended to store every type of
digital content, either retrieved by HTTP or another protocol.
The motivation to extend the ARC format arose from the discussion and
experiences of the International
Internet Preservation Consortium, whose core
mission is to acquire, preserve and make accessible knowledge and information
from the Internet for future generations. IIPC formed a Standards Working Group
to develop a document for the International Organization for Standardization to
approve.
Over a period of four years, the working group, with the Bibliothèque nationale de France as convener, collaborated closely with IIPC experts to improve the original
draft. The group will continue to maintain the standard and prepare its future
revision.
Standardization offers a guarantee of durability and evolution for the WARC
format. It will help web archiving entering into the mainstream activities of
heritage institutions and other branches, by fostering the development of new
tools and ensuring the interoperability of collections.
Several applications are already WARC compliant, such as the Heritrix web crawler, the ARC tools for data management and
exchange, the Wayback Machine, NutchWAX and other search tools for access. The international recognition of the WARC format and its
applicability to every kind of digital object will provide strong incentives to
use it within and beyond the web-archiving community.
Recent Comments