[wp-trac] [WordPress Trac] #60375: Site Transfer Protocol
WordPress Trac
noreply at wordpress.org
Wed Apr 10 20:29:52 UTC 2024
#60375: Site Transfer Protocol
-------------------------+------------------------------
Reporter: zieladam | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Import | Version:
Severity: normal | Resolution:
Keywords: | Focuses:
-------------------------+------------------------------
Comment (by dmsnell):
**A long post follows; please bear with me.**
In my proposal, I considered using a form of vector clock to track
potentially-unsynchronized state between connected WordPresses. I've tried
to convey an extremely rough sketch in the attachment above. This does not
address the conflated ID problem, but I can hopefully speak to that at the
end.
I propose a best-effort system for ensuring that updated resources are
detected and shared between connected sites, where connected sites are
admin-level connections communicating via a "backdoor" secure connection,
established by exchange of private/public key pairs.
For each connection, both sites will store a new record in their
synchronization state table indicating the identify of the connected
WordPress. This will be important for the UX of the system.
When resources are updated, they have inherent dependencies. These could
be files or related database records. By instrumenting `$wpdp` properly,
we can build associations and dependency chains automatically (or choose
to keep //all// resources in sync between sites and record everything).
Every time a record is updated, we track in a state table a version number
for that resource. This is a simple system: a write increments the version
by one, even if the data is the same as before the update.
A site will then have a new table tracking every uploaded file, every
plugin, every database record, and every of any other resource it has, as
well as a single number for each of those. This table will be much smaller
than the tables containing those resources. Deleting a resource can be
represented through `NULL` or `0` or some other //tombstone//.
When sites connect, a primary site can transfer all its records (the
//Transfer//) to the secondary site. It will record in the sync state
tracking which //version// of each resource it sent during the transfer
(and it can wait for acknowledgement from the receiving site). From this
point on it will have a sound guess at what content the secondary site
has.
When sites continue to communicate, the primary site can compare the
version of each resource it has updated against the version it last sent
to the secondary site. Any new, deleted, or updated resources are expected
to be stale on the secondary site and thus need to be transferred over.
**User flows**
It's at this point we can see some high-level designs in this approach.
For minimal additional work and storage we can track what content needs to
be transferred. This can be presented to a user in a dashboard, and we can
even create "recognizers" to further classify the resources. For example,
a plugin can give a name and description to an otherwise unknown database
row. The primary site can perform a quick computation to estimate the
total number of resources needing a transfer, as well as their approximate
byte size.
This method also depends on establishing two-way communication via the
"backdoor" channel. This can be achieved on standard WordPress hosts using
a combination of long-polling and `stream_select()` and some other
communication on the server, but does not require long-running PHP
processes or threads or forking processes. See the next attached image for
a preview of the dashboard.
This is a direct synchronization protocol, whereby two connected sites
trust each other, and the receiving site will import received content into
its database. Things is currently lacks is a sense of provenance. It would
be favorable to store the source and timestamp of all imported resources
in order to be able to show what has been sync'd vs. what was created
locally.
Because of the sync-state table all transfers are interruptable and
trackable. They can fail and be retried. Also, through the use of the HTML
API and dependency inference, it's possible to prioritize resource
transfer, such that dependent resources exist on the receiving end before
the resource itself. This leads to zero-downtime transfers where an
imported post is immediately complete upon import, since any linked
content exists first and the post can be rewritten upon arrival with the
HTML API to update those links.
**Discussion**
I apologize for how lengthy and simultaneously rough and prescribed this
is. I'm trying to dump some ideas "onto paper" since @zieladam and I have
spoken about this many times. It's a big-picture idea for a technical
design that powers a specific user flow, which is all about visibility
into a reliable and interruptable synchronization process.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/60375#comment:24>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list