Blogmarks

Technology

In 1980, Tim Berners-Lee wrote a personal hypertext notebook at CERN. The web was born from that impulse -- and then became an extraction machine. The protocols now exist to correct that.

Origins

It started as a personal notebook

In 1980, a software consultant at CERN wrote a program called Enquire. It was a private hypertext system -- a way to track connections between people, projects, and documents. Nodes linked to other nodes through typed, bidirectional relationships. There was no network. No server. No audience. Just one person structuring what they knew.

That consultant was Tim Berners-Lee. A decade later, he would propose the World Wide Web -- a system that took Enquire's idea of linked documents and opened it to everyone. The web succeeded beyond anyone's imagination. But in the process, it abandoned the thing that made Enquire work: the person who wrote the data was the person who owned it.

That loss is the root of everything that went wrong.

Conceptual diagram of Tim Berners-Lee's Enquire system: interconnected knowledge nodes with typed bidirectional links inside a personal boundary

The drift

What the open web became

The web succeeded because it was open. Anyone could publish. Anyone could link. No gatekeeper decided what was allowed. But openness without ownership created a vacuum. HTTP had no native concept of identity. No access control. No data portability. These were not oversights -- they were deliberate simplifications that made the protocol adoptable. Companies filled that vacuum with user accounts, cookies, and databases they controlled.

Two decades of this produced a specific economic structure. Your reading history, your bookmarks, your search queries, your social graph -- all stored on servers you do not control, governed by terms you did not negotiate, monetized in ways you cannot audit. Shoshana Zuboff named it surveillance capitalism: the unilateral extraction of human experience as raw material for prediction products.

There are two categories of data loss here, and the distinction matters. The first is data you gave willingly -- account signups, saved preferences, reading lists. You consented because there was no alternative architecture that let you own those things yourself. The second is data collected without your knowledge or meaningful consent -- cross-site tracking, browser fingerprinting, behavioral profiles assembled from metadata you never intended to share. Both feed the same machine.

Even the tools that appear to respect users -- Pocket, Notion, Raindrop, Instapaper -- hold your data on their servers, in their format, under their terms. They are better landlords, but they are still landlords. You cannot take your data and walk away without their permission. You cannot grant your own AI agent access to your own reading history without going through their API, if they offer one at all.

The problem is not bad actors. The problem is that the architecture of the web made data extraction the path of least resistance, and data ownership an afterthought that nobody built into the protocols.

Abstract diagram showing one-directional data flows from a user outward to multiple opaque corporate silos with no return path

The correction

Protocols that put ownership back in the stack

The problem is architectural, so the fix must be architectural too. Not a privacy policy. Not a promise. Not a settings toggle buried in a menu. Protocols that enforce ownership at the byte level -- where data is stored, who controls access, and how it moves between systems. These are the standards Blogmarks is built on.

W3C Solid Community Group / Inrupt

Solid (Social Linked Data)

Solid is a W3C specification created by Tim Berners-Lee. At its core is a simple idea: give every person a Pod -- a typed, permission-controlled filesystem on the web. Every resource in a Pod has a URL and a content-type. The owner decides which applications and agents can read or write. Access is controlled at the protocol level, not by platform policy.

This is Enquire reconstructed for the modern web. Where Enquire was a local notebook, a Solid Pod is a web-addressable notebook -- yours to share selectively, yours to revoke access from, yours to migrate between providers without data loss. The applications that read your Pod are decoupled from the storage itself. You can switch clients without switching data.

How Blogmarks uses this

Blogmarks uses a Solid Pod as the single source of truth for all user data. Raw assets (the original bytes you saved) and their extracted knowledge (Markdown, transcripts) live in your Pod. Blogmarks never holds canonical user content on its servers. This is not a feature -- it is a structural guarantee. Blogmarks cannot monetize your reading history because it never possesses it.

Isometric diagram of a Solid Pod showing WebDAV interface, typed resources, ACL permission layer, and multiple client apps connecting through the access boundary
Anthropic (open specification)

MCP (Model Context Protocol)

Knowledge stored in your Pod only has value if it is accessible to the right tools. The Model Context Protocol is an open standard that defines how AI agents request and receive structured context from external sources. It is the interface between your knowledge base and any AI assistant you choose to use.

MCP is designed for local-first operation. A Blogmarks MCP server runs on your machine, queries a local vector index derived from your Pod, and returns semantically relevant context to the agent. The AI never accesses your Pod directly. Your data never leaves your device to reach Blogmarks servers. The query path is entirely local.

How Blogmarks uses this

Blogmarks exposes your knowledge base through a local MCP server compiled as a single Rust binary. It is AI-provider-agnostic -- Claude, ChatGPT, or any future agent that supports MCP can query what you have saved. Zero data egress at query time. The MCP server is a read-only retrieval interface; it never runs extraction logic or writes to your Pod.

Sequence diagram showing an AI agent querying a local MCP server, which queries a local vector index, all within a User Device boundary with no external data crossings
W3C Schema.org Community Group (Google, Microsoft, Yahoo, Yandex)

Schema.org

Schema.org is a collaborative vocabulary for structured data on the web. Founded in 2011 by the four largest search engines, it has been stable for over fifteen years, is used by more than 30% of web pages, and has never broken backward compatibility. It defines type hierarchies and properties that let any machine -- search engine, feed reader, AI assistant, knowledge graph -- understand content without custom parsing.

Structured data is how your content participates in the open web without depending on any single platform to interpret it. JSON-LD (JavaScript Object Notation for Linked Data) is the recommended serialization format: a script tag in your HTML that describes the content semantically. No special libraries. No build plugins. No runtime cost.

How Blogmarks uses this

Blogmarks maps its data model to three Schema.org types: BlogPosting for enriched bookmarks (headline, abstract, keywords, author, dates), BookmarkAction for the curation event itself (separating the act of saving from the content saved), and DefinedTerm for the tag taxonomy. 17 of 19 data model fields map natively to Schema.org properties. The remaining two are internal operational metadata that no external vocabulary should govern.

W3C

RDF, Linked Data & WebID

RDF (Resource Description Framework) is the W3C standard for describing resources as subject-predicate-object triples -- the smallest possible unit of meaning. Linked Data principles extend this: use URIs to name things, make those URIs dereferenceable, and link to other resources so that discovering one piece of data leads to more. WebID provides decentralized identity: a URI that identifies a person, resolvable to a profile document that describes them and their access permissions.

These are the foundations that make Solid Pods work. Data in a Pod is not just files -- it is typed, interlinked, and self-describing. A Blogmarks asset stored as RDF triples can be understood by any Linked Data client without prior knowledge of Blogmarks. WebID means your identity is not issued by a platform -- it is a URI you control.

How Blogmarks uses this

Blogmarks defines its own RDF vocabulary in Turtle (Terse RDF Triple Language) for asset metadata. Each bookmarked asset is described with predicates for source URL, canonical URL, content type, extraction confidence, and more. These triples are stored alongside the raw bytes and extracted Markdown in your Solid Pod. WebID governs who can access them.

Apache Software Foundation (Top-Level Project)

Apache OpenDAL

Apache OpenDAL is a unified data access layer written in Rust. It provides a single Operator API for reading, writing, listing, and deleting across more than 40 storage backends -- including WebDAV (the protocol Solid Pods expose), local filesystem, S3-compatible stores, and in-memory backends for testing.

The value is backend portability. Application code uses one API. Switching from a local filesystem in development to a Solid Pod in production is a configuration change, not a code change. The abstraction does not leak Solid-specific concerns into application logic -- access control, WebID authentication, and RDF handling live in a thin layer above OpenDAL.

How Blogmarks uses this

Blogmarks uses OpenDAL as the storage access layer between all Rust components and the Solid Pod. The MCP server reads assets through OpenDAL. The ingestion pipeline writes through it. In tests, the in-memory backend eliminates network dependencies entirely. OpenDAL is feature-gated to include only the backends Blogmarks needs, keeping the compiled binary small.

Web platform

The web platform, used correctly

Beyond the ownership-specific protocols, Blogmarks uses the web platform as it was designed to be used. No frameworks on top of frameworks. No proprietary extensions. Standards that have been stable for years, implemented correctly.

Progressive Web App + Share Target API

Installable on any device, works offline through service workers, receives shared URLs from any app via the Web Share Target API. No app store. No gatekeeper.

Service Workers + Workbox

Intelligent caching: CacheFirst for static assets, NetworkFirst for API calls. The app loads instantly from cache and updates in the background.

RSS 2.0

An open syndication feed with no algorithm, no account required, no tracking. Subscribe from any feed reader. The format is 25 years old and still the best way to follow content.

WebDAV

The HTTP extension protocol that Solid Pods expose for resource management. Read, write, list, delete -- standard HTTP verbs on web-addressable resources.

JSON-LD

The serialization format for Schema.org structured data. A script tag in HTML that makes content machine-readable. No runtime, no library, no build step.

Security

Security headers are not optional

Every response from Blogmarks includes security headers that restrict what the browser is allowed to do. This is not defense in depth -- it is the baseline. No tracking scripts. No third-party analytics. No fingerprinting.

HeaderPolicy
Content-Security-PolicyRestricted connect-src and script-src. Only self, authentication endpoints, and explicitly trusted origins.
Strict-Transport-Securitymax-age=31536000; includeSubDomains. All connections over HTTPS. No exceptions.
X-Frame-OptionsDENY. The application cannot be embedded in iframes. Clickjacking is structurally impossible.
X-Content-Type-Optionsnosniff. The browser will not MIME-sniff responses away from the declared content-type.
Permissions-Policycamera=(), microphone=(), geolocation=(). Hardware capabilities are denied at the policy level, not just unused.

Architecture

How it fits together

The Blogmarks architecture has four runtime components with strict separation of concerns. The ingestion pipeline writes. The MCP server reads. The Solid Pod stores. The browser extension triggers. They never cross boundaries.

A URL enters the system through the browser extension or mobile share sheet. The ingestion pipeline fetches the bytes, detects the content-type, runs the appropriate extractor (HTML, PDF, and more to come), and stores both the raw bytes and the extracted Markdown in the user's Solid Pod. It then indexes semantic chunks in a local vector store.

The MCP server -- a separate, compiled Rust binary -- queries that vector index on behalf of AI agents. It reads from the Pod through OpenDAL for full-text retrieval when needed. It never runs extraction logic. The ingestion pipeline never handles queries. The Solid Pod is the only shared state between them.

This boundary is sacred. It means each component can be replaced, upgraded, or audited independently. It means the MCP server has no write path to your data. It means your knowledge base is never one monolithic system -- it is a pipeline with clean interfaces.

System architecture diagram: Browser Extension sends URLs to Ingestion Pipeline, which writes to a central Solid Pod. The MCP Server reads from the Pod through OpenDAL to serve AI agents.

Manifesto

Your assets. Your bytes. Your knowledge.

The things you read, watch, and listen to are part of who you are. They should belong to you -- not to the platform that happened to serve them, not to the startup whose database holds your reading history.

They should be legible to your tools, portable across apps, and available to your AI agents without asking anyone's permission.

A bookmark is not a URL. It is a pointer to bytes. Those bytes have a content-type, and the content-type determines how to extract meaning from them. Blogmarks resolves the pointer, fetches the bytes, runs the right extractor, and stores the resulting knowledge -- owned by you, in a Solid Pod, forever rebuildable.

Tim Berners-Lee's first program, Enquire (1980), was a private hypertext notebook -- local, personal, yours. Solid is Enquire reconstructed after watching what the open web became. Blogmarks builds on that correction.

Knowledge you accumulate should compound for you, not for someone else's data business.
Timeline illustration: 1980s terminal with connected nodes (Enquire) on the left, modern interface with AI-enriched knowledge graph (Blogmarks) on the right, connected by an evolutionary arc