- Scope
A complete open publishing scenario consists of the following parts:
- Writer creates a GIT repository and does necessary setup to make it an open publishing repo.
- Writer authors a document (in markdown format) and adds it to his local GIT repo.
- Writer pushes his local GIT repo to remote.
- Open publishing picks up the GIT change and transforms the markdown file into output (in HTML format).
- Open publishing delivers the output HTML file to the MSDN web site.
- Open publishing sends notification to writer to let him know his document has been published.
- Writer is able to view his document on MSDN web site.
Open publishing build is mainly involved in #1 (Provisioning), #4 (Build), and #6 (Notification).
- Architecture Design
Here is a diagram of the overall architecture of build service.
There are three main services in open publishing:
- Build, responsible for provisioning, transform user's content into output, push to delivery service and notify user.
- Delivery, responsible for storage of content.
- Rendering, responsible for render output into final HTML page.
And below is the flow for provisioning and publishing scenario:
Provisioning:
- Writer create a GIT repo.
- Writer calls build web service API to create an open publishing repository.
- Build service writes repo information in configuration DB.
- Build service calls delivery service to do necessary setup at delivery service side (like create delivery repo).
- Build return success to user.
Publishing:
-
Writer commits a change to local GIT repo and push to remote.
-
GIT server send webhook to build web service.
-
Build web service calls dispatcher to dispatch build request.
-
Dispatcher reads configuration DB and dispatch build request to build worker.
-
Build worker pulls changes from GIT repo and read previous build output from storage to do an incremental build.
-
Build worker saves build output and build report to storage.
-
Build worker pushes build output to delivery service through delivery web service.
-
Build worker sends notification to writer about completion of publish.
-
(Alternative) Writer can also call build web service to query publish status.
-
Writer sees his content on MSDN web site (through rendering service).
-
Provisioning
The first step of open publishing is to create a GIT repository and connect it to open publishing so that any changes in this repo can be automatically published.
This step is called provisioning, which mainly contains two parts:
- Git repository must be created following the schema defined by open publishing.
- User must specify the GIT repository and necessary configuration through management portal or API.
The design principle here is to make GIT repository self-contained, which means all information including content, metadata and configuration can be stored in a central place. So all publish operation can be done by manipulating GIT repo. But there will always be some configuration that is considered as "expensive" and cannot be done by manipulating GIT repo. These configuration need to be specified in #2.
Under a GIT repository, there will be multiple docsets. A docset is a group of documents that share the same configuration like template, base url, etc.
Each docset is a folder, under which there is a docset.json
that defines the basic properties of a docset.
Under the root folder of a repository, there is a siteCatalog.json
that defines all docsets in this repo.
A docset must be added to
siteCatalog.json
to be detected by open publishing.
Here is a diagram that illustrates the structure of an open publishing GIT repo.
/
|- siteCatalog.json
|- docset1/
| |- docset.json
| |- a.md
| \- b.md
\- docset2/
|- docset.json
|- c.md
\- d.md
After a GIT repository is created, user must manually configure it to be monitored by open publishing. This can be done through open publishing management portal or API.
Docset must be added to the monitor list manually, open publishing won't monitor docset creation/deletion and automatically provision them. Docset creation is supposed to be a "heavy" operation so it need to be configured manually.
After this step, for all changes in a monitored docset, open publishing will automatically build it and publish the content to MSDN web site.
- Build
A docset contains the following files:
docset.json
, which defines the basic properties of a docset.- Document source files, currently we only support markdown file, but we're open to support more formats (e.g., HTML, ReST) in future.
toc.md
that describes the TOC of the docset.- Support files, which are used by document source files, e.g., image, video, etc.
Here is a diagram that illustrates the structure of a docset:
/docset
|- docset.json
|- toc.md
|- docs/
| |- get-started.md
| \- overview.md
\- images/
|- overview.png
\- favicon.png
Open publishing supports Github Flavored Markdown, which is a superset of standard Markdown. Open publishing will also support additional extensions, like content include (more extensions to be added later).
Markdown file can also contain metadata. We employ YAML frontmatter to store metadata. This format is already widely used in github and will be rendered into a table in github markdown preview.
Here is one example:
---
title: Get Started
toc_title: Get Started
---
body of the content...
toc.md
is used to define the TOC (table of contents) structure of a docset.
We uses headers to specify the level of toc. For example:
# Tutorial
## Step 1
## Step 2
## Step 3
The above example illustrates a parent topic "Tutorial" with three children Step 1-3.
We use standard Markdown link syntax to specify the target topic of the toc node. For example:
# [Tutorial](tutorial.md)
## [Step 1](step-1.md)
## [Step 2](step-2.md)
## [Step 3](step-3.md)
If a toc node is not a link, it will become a container node (can contain children but cannot be clicked).
TOC node can also be an external link (for example, www.bing.com
).
Since markdown only support 6 levels of header, we'll only support 6 levels of TOC for now. This can be exteneded in future.
toc.md
cannot contain arbitrary markdown content. For example, you cannot have images in it. All content that are not headers will lead to build error.
In open publishing, we use GIT branches to maintain different copies of the same content. Same topic in different branch will be published to different urls of rendering site. This could be used in several scenarios:
- Stage/promote. User can use one branch as staging branch and merge it to a live branch as a promote operation.
- Private working branch. User can create his own branch to save files that is still under writing, and merge it to common branch when it's ready. Files in private branch can still be validated/previewed by open publishing.
Among all branches, there should be only one branch that is exposed to external users, as other staging/working branches are only used by writers for internal testing. This branch is called "Live" branch. Live branch can be identified by a special branch name. Content in live branch will be published to the MSDN live site. Content in other branches will still be published, but only in internal MSDN site.
Live branch can also be used to store branch related configuration. For example, if we want to have a configuration about which branches will be published (other branches will only be validated, as publish may be an "expensive" operation), this configuration can be stored in live branch.
It'll use a common case that user wants to try new features (like a new markdown extension) in his working branch while keep live branch using stable features. We're going to support this by allow user to specify the build toolset they want to use (in GIT repo). We'll release multiple versions of our build tool and let user to choose which one they want to use. After they test the feature in working branch, they can merge the content to live branch then live branch will also upgrade to use the new build tool.
Open publishing will monitor changes in GIT repository and automatically build and publish changes. We'll leverage webhook provided by GIT server to get notified when there is an event happened on GIT repo.
Our goal is to support any GIT repository, but GIT server implementation varies from one to another. Our principle to use standard GIT operation and minimize server-side GIT dependencies as much as possible, but there will always be specific GIT server implementation like webhook. Our first priority is to support github and Visual Studio Online.
Both github and Visual Studio Online provides webhook (github, VSO) to get notification on repo changes.
In case webhook is not reliable (VSO webhook has known reliablity issue), polling is always an alternative solution.
For one repo, all changes will be built sequentially. There're two ways to build GIT changes:
- For each changes in GIT repo, start a build.
- Rolling build, aggregate all changes since last build and build them.
#2 may be more efficient but #1 will be more accurate. For now we will use #1 to build changes.
We'll use open source markdown library (marked.js is a good candidate) to transform markdown files to HTML files. There will be additional to develop transformation for markdown extensions.
Transform also includes other tasks, including:
- Validation, validate source files and generate validation report
- Generate TOC
- Process metadata
- Generate dependency information for incremental build
The output of transform will be in a folder structure. There're a few benefits for using files as output:
- It can be run both locally or on build server, without taking additional dependencies like database.
- Build result can be easily xcopied for troubleshoting.
- It's easy to build version control on files for transaction control.
// Not clear for now
To increase publish system throughput and reduce publish response time, it's important to have an incremental build system.
There're several common scenarios that can benefit from incremental build:
- User only modifies several files in a GIT repo, obviously there is no need to build the whole repo.
- User merges from branch A to branch B, there is no need to build branch B as branch A should be already built, so it just need to copy the build output from branch A to B
- User create a new branch B from branch A, similar to #2, only need to copy build result from branch A.
Top-down approach is like traditional source code build, build scans all files in the repo to figure out which files needs rebuild.
For each file, build system will maintain a commit hash of the last build result, which can be used to determine whether a file is changed since last build.
For a source file, we'll check the following to determine whether the file needs rebuild:
- The commit hash of the file itself.
- If #1 differs, rebuild. Otherwise, a dependency information is maintained in last build output. Get the dependency information and check the commit hash for all dependencies.
Build output can be organized in a folder structure so that we can quickly get the history build output by composing a path. For example:
//output/<commit_hash>/a.html
Bottom-up approach is to first detect which files are changed, then find out all files affected by changed files and rebuild them. This will be more efficient for small changes in large repos as scaning a repo may still take a long time.
To achieve this, we first need to get the changed files since last build. This can be easily get by diff two commits in GIT repo.
Then we need to figure out the files affected by changed files. To get this information, a reverse dependency information is needed. We can store this information when build the source files.
The real implementation could be a combination of both approach, as bottom-up approach is more efficient but implementation may be error-prone. Top-down approach is less efficient but more accurate.
The following GIT operations should be interpreted correctly to have an efficient incremental build (if not, the build output is still correct but performance may be affected):
- File rename
- GIT new branch
- GIT merge branch
Since we don't another ID other than file path, we cannot get 100% accurate file rename information (GIT is not accurate on file rename). But as rename on live is a corner scenario, we'll just not support it. Rename will be simply a delete + add on delivery service (for build, detecting file rename will just have performance benifit).
// Still pending on how TOC will be implemented at delivery service
Delivery service will provide REST API to push build output. When talking with delivery service, we will use path to uniquely identifies a file. As described in preview section, if path of a file changes, delivery service will treat it as a new file.
Delivery serivce has a concept of repository, which maps to docset at build side.
Delivery service also has the concept of branch for a repo, which maps to the GIT branch at build side.
As a result of incremental build, build will only push the changed documents to delivery service. If delivery service supports the following operation, build serivce can leverage them to improve publish performance. If not, build will translate these operations to CRUD operations:
- File rename
- Branch merge
// To be written
- Notification
Publish is an asynchronized operation. It's triggered by GIT push, but it won't block push operation as publish may take a long time. So there has to be a way to notify user when publish is completed (or failed).
The most basic notification functionality we will provide is publish status API. User can use the last commit hash to call publish status API to get the pulbish status (succeeded, failed, processing) and publish report.
We can also consider to support query publish status using the following criteria:
- Branch
- Tag
- GIT Revision range (01234567..89abcdef)
- File path to query the status of a single file
The same query functionality will also be available on management portal.
Besides query API, we will also provide push functionality to let users be notified instead of polling:
-
Email
-
Webhook
-
API and Management Portal
As described in previous section, user will mainly interact with open publishing through GIT repo. For example, to publish, user just need to do a GIT push. To configure a docset, user just update the configuration file in GIT repo.
But there're still some operations that need to be done outside GIT repo, like provision a repo, and as described in previous section, query a build.
All these operations can be done through open publishing API.
Open publishing API will be mainly supports the following operations:
- Create a repo. Given a GIT repo url, creates a corresponding "repo" in open publishing.
- Configure docset watchlist. Given an open publishing repo, add/remove docsets to be watched by open publishing.
- Triggers a publish. Though a publish can be triggered by GIT push, there will be some cases that user wants to publish manually.
This API provides this functionality. The input parameters could be:
- Branch
- Tag
- Commit hash
- Query publish status and get publish report, already described in previous section.
Management portal is just a graphics interface that provides the same functionality as API.