Brainstorming distributed publishing

Turbocapitalist · 11-02-2021, 03:11 PM

I am looking explore options for distributed publishing. What software is there?

The material to be published is a mixture of text (lots of it), supplemented with images, audio, and video. In other words, it is basic web page material, but heavily text-oriented and numbering many tens of thousands of documents. I have no qualms about leaving HTTP/HTTPS behind if necessary or would Rsync web site mirrors be the best bet? In that case would it help if the documents were static, generated by a static site generator, or dynamic and stored in WordPress? NNTP and IPFS are out for different reasons, IPFS in particular because of its heavy CPU and bandwidth requirements. Gemini looks promising but is not finalized and currently has issues with large files, such as video. How is the redundancy in Ceph and can it tolerate nodes appearing and disappearing?

What else is there? Or, what else could be stitched together?

boughtonp · 11-03-2021, 08:55 AM

I can't tell if you're asking for a variation on a CMS, a Wiki, Wave, or something else.

Without more clarity, the only thing I can say is Git may be a better choice than Rsync, but may not be as good as something else.

Turbocapitalist · 11-03-2021, 09:25 AM

CMS and Wikis are centralized, as far as I know. I'm looking for something where several machines can serve the same documents even when one of the machines is unavailable. The overall availability should continue while one node is down..

boughtonp · 11-03-2021, 09:56 AM

CMSs and Wikis are not required to be centralized - they gain fault tolerance and availability through redundancy and replication.

https://en.wikipedia.org/wiki/Replication_(computing)

Turbocapitalist · 11-03-2021, 10:13 AM

The documents already exist so a Wiki would be out. From what I've seen, a CMS requires a substantial investment in learning and maintaining the software. Since the documents already exist as individual files, I am wondering what file oriented approaches are out there.

wpeckham · 11-03-2021, 10:40 AM

Quote:

Originally Posted by Turbocapitalist

The documents already exist so a Wiki would be out. From what I've seen, a CMS requires a substantial investment in learning and maintaining the software. Since the documents already exist as individual files, I am wondering what file oriented approaches are out there.

Have you considered a torrent node?
The problem would be advertising the files and convincing other nodes to replicate them. Once you accomplish that the files are available with a simple torrent link to anyone with a torrent client, and can come form any or several of the torrent nodes hosting the files. (Including client machines that have downloaded them!).

boughtonp · 11-03-2021, 11:29 AM

Ok, so this is essentially about reliable file download?

Are all the documents already created or will new ones arrive or existing ones be updated or replaced?
And will end-users always want the latest versions, or just the ones at the moment they download?

Torrents might work if they don't change and just want the current file(s), but if you add/remove/change the contents you get a new hash/torrent, and thus fork/divide the seeders each time there's a change.
On the other side of things, a Git-based solution would be better when users want to receive changes without re-downloading everything, however availability through multiple remotes would probably be unwieldy - better to run it atop a distributed file system, which would be what you wanted for non-Git downloads too.

https://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems
and
https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems#FOSS

I now see the "Ceph" that you mentioned before, but there's no "Gemini" (and a search seems to only bring up a single on-topic result, which is a PDF).

There's at least seven other non-proprietary options with "high availability" marked as yes - I'd start by checking which of those are either already in kernel (e.g. Coda) or supported/available in your distro's repos.

Turbocapitalist · 11-03-2021, 11:58 AM

Yes, it's about reliable download and browsing (in the generic sense) of various files. The caches or repositories would have to be updated with new documents frequently, perhaps several times per day, but once in the system the documents do not change.

Torrent would be great, if it were feasible to keep adding documents to the seed.

At this point I am wondering about disseminating the files within a pool of distributed nodes, that is to say the back end for any access system. Above all, I would like to keep it about two or three orders of magnitude simpler than WordPress. What about the Coda File System? Or can HAMMER2 nodes be far apart?

Gemini might be one possible way to access the documents from the outside: https://gemini.circumlunar.space/ but only once the files are already on the nodes.

computersavvy · 11-03-2021, 04:05 PM

Quote:

Originally Posted by Turbocapitalist

CMS and Wikis are centralized, as far as I know. I'm looking for something where several machines can serve the same documents even when one of the machines is unavailable. The overall availability should continue while one node is down..

So you are thinking of something such as a High Availability (HA) server. The tools are there, but I do not know if they are scaled for smaller usage. Many commercial organizations have their main systems in HA config so that if one fails it automatically fails over to the other and the user likely never even is aware. The data is mirrored between systems and the server is fail safe so when one fails the other automatically picks up.

Searching for high availability servers should give you several possibilities. It is not even really hard to set up for those who know how. It does require a minimum of 2 network paths between the servers and data stores.

The simplest would be 2 machines, each fully configured to be identical, where the active machine has constant monitoring by the standby machine and if connection is lost the standby picks up the network address and data services of the (previously) active one. There has to be continuous communication between the two where any data change on the active machine is mirrored on the standby machine immediately.

chrism01 · 11-03-2021, 07:56 PM

By the sounds of it, you could look at LVS with Direct Routing + round robin weighting http://www.linuxvirtualserver.org/how.html at the front end.

Then at least 2 webservers behind that and a NAS/SAN for actual docs behind them.
Obviously use RAID for max uptimes.

Turbocapitalist · 11-04-2021, 03:05 AM

High Availability but at a relatively small scale sounds about right, but without the heavy overhead of the larger approaches. The burden, especially that of specialized knowledge, required of the system administrator(s) must be as low as possible. It needs to avoid Parkinson's Law and pursue KISS.

The Linux Virtual Sever with Direct Routing looks about right except the method ought to work for nodes on different ISP's networks, sometimes with a noticeable latency between some of them.

I think I might be able to test something with Ceph soon.

wpeckham · 11-04-2021, 09:13 AM

So something like a gemini, httpd, or Archie server as a distributed cluster so that if one site/server died the internet presence would still exist?
Hmmm. That should be doable, but the best way would be a team creating cone servers at different sites with a mutual update mechanism, meaning you would all have to agree on formats, software, and both presentation and communication standards.

Torrents (one per file) and a torrent directory might be the easier option, but if you decide to go with some distributed cluster solution I want in on this! Sound like FUN!

computersavvy · 11-04-2021, 02:34 PM

When I worked for IBM the HA was accomplished with HA servers and a SAN data store that both servers accessed.

Each server monitored the other and if the 'master' failed to repond properly in a specified time period the secondary took over everything, including telling the 'master' to go offline totally if needed. They shared a network so there was a 'management' IP and a 'service' IP. The running server used 'service' IPs that were shared with the backup but those IPs were down on the backup server and active on the 'master'. The fail-over took those IPs down on the 'master' and brought them up on the backup so the user never knew the difference.

I can easily envision an NAS file server (or 2, running mirrored) with 2 servers using that data store. The 2 servers could easily be configured with identical services and as has been mentioned could use a round-robin style dns service to share the load or if one failed the remaining could do everything.