IndieArchive: Difference between revisions

Revision as of 18:27, 10 May 2014

This article is a stub. You can help the IndieWeb wiki by expanding it.

IndieArchive is a project to collaboratively grow an archive of pages replied to (possibly also mentioned) in indie web posts.

Github: https://github.com/i-a
Live raw files: http://i-a.github.io/

URL design

Like archive.org, but without dumb/meaningless redundancy, and dates/times compressed with NewBase60 to make the URLs even shorter.

archive.org example:

i-a design:

3_C = 2005-03-10 converted to NewBase60 epochdays
- e.g. using CASSIS ymd_to_sd() function
3Kj = 03:19:44 = 11984 seconds in NewBase60
- e.g. using CASSIS num_to_sxg() function

Notes:

no "web." nor "/web" - we know this is a web archive, duh.
no "http:" - it's the default - optimize for the majority.
- Other schemes may be explicit, but omit the "//" (per [1]), e.g. "https:", "ftp:" as shown in the example
Trailing slash is required on root-level domains
- Whenever a URL has a trailing slash, filename on disk must be `index.html`

Origins

A conversation between aaronpk and tantek on IRC about link rot and an OSCON session (2011? 2010?) that was about a WordPress plugin that automatically made personal archives of pages you linked to.

So what would be a good way to make a backup of every page you link to from your site?

We could do it collaboratively for public pages, automatically in our publishing systems of pages we link to.

Use-cases:

repair link rot. when things you've linked to go down or away, (semi-)automatically, redirect those links to your archives
shared reply/link contexts. see below thoughts from IRC.

Thoughts

On the design / usage of IndieArchive - notes from IRC:

Inception:

Itch: "holy cow, I've already downloaded 63 megs of HTML from external pages in my reply context code"[2]
Scratch: "maybe we should just shove archive HTML pages into static github pages"[3]

Design:

keep snapshots by datetime index… we won't collide

we place no copyright claims and state it is purely for library/archive purposes only, and anyone is welcome to clone (not modify) and keep the same terms

once we get it going, we could even likely talk archive.org into maintaining a mirror of it - and it's git, so it's easy for anyone to mirror!

performance shouldn't be an issue either - as rarely should we need to reparse the HTML

still locally store parsed JSON bits for speed of retrieval / display on my server, but then keep a URL to the copy of the HTML on i-a.

Web of Social Trust:

we build up the write-access contributors through social web of trust from the initial seed of indieweb camp attendees that have their own indieweb sites setup - whatever they used to sign-into indiewebcamp.com

Aside: IndieWeb vs. FSW methodologies and coming up with this

this is the kind of stuff that you end up figuring out / building when you scratch your own itches and follow them to their logical conclusions
I don't think anyone in FSW circles came up with the idea of a distributed collaborative mirroring of posts that they replied to! (despite it being a really simple idea)
everyone was so focused on building a full stack that did everything
instead of distributed systems that cooperatively grew

Aside: Unexpected uses

wouldn't surprise me if someone asks for a feed of all the URLs, as they happen, get linked to

Collaboration

The Internet Archive (archive.org) is a logical organization to coordinate with for indiearchiving.

2013-146 Brewster Kahle said to email him re: how can the Internet Archive help with IndieArchive.

Brainstorming

Some blue-sky speculation on the idea of "memory" in the web - both for remembering (archiving) as well as forgetting.

A site that notices certain off-site links are frequently clicked could begin archiving the target page locally - it retains a "memory" of the target content, and so that content has a greater chance of remaining over time. Conversely, a site that feel a need to do so, could choose to remove un-clicked links over time - it "forgets" about this portion of the web.

This could also be done directly by clients browsing sites too. Frequently visited sites are archived locally. Infrequently visited sites are just not archived ("forgotten").