From ff212b45613e62b7e806e1caf761855a483b2c3b Mon Sep 17 00:00:00 2001 From: Henry Wilkinson Date: Fri, 29 Nov 2024 12:50:38 -0500 Subject: [PATCH 1/2] Updates Webrecorder tools - Expands WARC section to include WACZ files - Adds ArchiveWeb.page - Adds Browsertrix - Adds other WARC related command line utilities --- README.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 767dc12..9fcab4b 100644 --- a/README.md +++ b/README.md @@ -1760,16 +1760,19 @@ Google lens is not too user friendly for investigations. But this tool will help -### [](#warc)Tools for working with WARC (WebARChive) files +### [](#warc)Tools for working with WARC (WebARChive) and WACZ (Web Archive Collection Zipped) files | Link | Description | | --- | --- | -| [Warcat](https://github.com/chfoo/warcat) | My favorite (because it's the easiest) tool for working with Warc files. It allows you to see the list of files in the archive (command "list") and unpack it (command "extract"). | -| [Replayweb](https://github.com/webrecorder/replayweb.page) | If the warc file is small, you can view its contents with this extreme simple online tool. Also it's possible to deploy ReplayWeb on your own server | -| [Metawarc](https://github.com/datacoon/metawarc) | Allows you to quickly analyze the structure of the warc file and collect metadata from all the files in the archive | -| [Webrecorder tools](https://webrecorder.net/tools) | Archiving various interesting sites is a noble and useful activity for society. To make it easier for posterity to analyze your web archives, save them in Warc format with an online tool| -| [GRAB SITE](https://github.com/ArchiveTeam/grab-site) | Af you need to make a Warc archive out of a huge site with a lot of different content, then it is better to use this #python script with dozens of different settings that will optimize the process as much as possible.| -| [har2warc](https://github.com/webrecorder/har2warc) | Convert HTTP Archive (HAR) -> Web Archive (WARC) format| +| [Warcat](https://github.com/chfoo/warcat) | My favorite (because it's the easiest) tool for working with WARC files. It allows you to see the list of files in the archive (command "list") and unpack it (command "extract"). | +| [Browsertrix](https://webrecorder.net/browsertrix/) | Browser-based crawling service that saves websites as WACZ files (containing WARCs). Hosed as SaaS by Webrecorder but can alternatively be self-deployed on your own infrastructure | +| [ArchiveWeb.page](https://webrecorder.net/archivewebpage/) | Create WARC and WACZ files interactively as you navigate sites in your web browser. Good for saving high-fidelity | +| [ReplayWeb.page](https://webrecorder.net/replaywebpage/) | If the WARC file is small, you can view its contents with this extremely simple online tool / desktop app. WACZ files of any size will load much faster due to their built-in index. Also it's possible to deploy ReplayWeb.page on your own server | +| [Metawarc](https://github.com/datacoon/metawarc) | Allows you to quickly analyze the structure of the WARC file and collect metadata from all the files in the archive | +| [warcit](https://github.com/webrecorder/warcit) | Command line utility to convert a local directory containing website files into a WARC file | +| [unwarcit](https://github.com/emmadickson/unwarcit) | Command line utility to convert a WARC or WACZ file to a local directory containing website files | +| [GRAB SITE](https://github.com/ArchiveTeam/grab-site) | Af you need to make a WARC archive out of a huge site with a lot of different content, then it is better to use this #python script with dozens of different settings that will optimize the process as much as possible.| +| [har2warc](https://github.com/webrecorder/har2warc) | Convert HTTP Archive (HAR) → Web Archive (WARC) format| [](#archives-of-documentsnewspapers)Archives of documents/newspapers From 497aaaa96cd73de00980c8b8df02cdaabdd9ff35 Mon Sep 17 00:00:00 2001 From: Henry Wilkinson Date: Fri, 29 Nov 2024 12:51:33 -0500 Subject: [PATCH 2/2] Adds Ghostarchive - Adds description to Arquivo.pt --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9fcab4b..efcda75 100644 --- a/README.md +++ b/README.md @@ -1726,11 +1726,11 @@ Google lens is not too user friendly for investigations. But this tool will help | Link | Description | | --- | --- | | [Quick Cache and Archive search](https://quickcacheandarchivesearch.onrender.com/) | quick search website old versions in different search engines and archives (21 source) | -| [Trove](http://trove.nla.gov.au/search/category/websites) | australian web archive | +| [Trove](http://trove.nla.gov.au/search/category/websites) | Australian web archive | | [Vandal](https://chrome.google.com/webstore/detail/vandal/knoccgahmcfhngbjhdbcodajdioedgdo/related) | extension that makes working with [http://archive.org](http://archive.org) faster, more comfortable, and more efficient. | | [TheOldNet.com](https://theoldnet.com/) | | | [Carbon Dating The Web](http://carbondate.cs.odu.edu/) | | -| [Arquivo.pt](https://arquivo.pt/) | | +| [Arquivo.pt](https://arquivo.pt/) | Portuguese web archive | | [Archive.md](https://archive.md/) | | | [Webarchive.loc.gov](http://webarchive.loc.gov/) | | | [Swap.stanford.edu](https://swap.stanford.edu/) | | @@ -1739,6 +1739,7 @@ Google lens is not too user friendly for investigations. But this tool will help | [web.archive.bibalex.org](http://web.archive.bibalex.org/) | | | [Archive.vn](https://archive.vn/) | | | [UKWA](https://www.webarchive.org.uk/) | archive of more than half a billion saved English-language web pages (data from 2013) | +| [Ghostarchive](https://ghostarchive.org/) | Free web archive that uses ReplayWeb.page for viewing archived sites | ### [](#tools-for-working-with-web-archives)Tools for working with web archives