ESGF_Transition

ESGF Transition Guide

The transition from the ESG gateway-centric system based on gateway V1.3.4, to the ESGF peer-to-peer (P2P) system, will proceed in phases. The milestones associated with each phase must be met before moving to the next phase

For the purpose of transitioning, we distinguish between:

_ data _ nodes, which provide only data-centric services, such as THREDDS, gridftp, and possibly LAS
full-service nodes (aka _ index _ nodes), which provide a full set of ESGF services, including index (search), identity provider, attribute services, and possibly the ESGF front-end application in addition to data services.

It is expected that current gateway sites will transition to full-service nodes, while current non-gateway nodes will upgrade to P2P data nodes. Upgrade to full-service nodes can proceed in parallel at the existing sites, with several caveats:

As P2P index nodes join the federation, they will only crawl THREDDS catalogs for non-index sites. The index nodes must recognize which sites are in the federation and adjust the periodic crawling operation to reflect the state of the federated system.
The cutover period of Phase 4 depends on all nodes having completed Phases 1-3 of the transition.

Phase 1: Beta Test

The P2P system is available to ESGF developers (initially) and friendly users for testing.

Milestones:

The system is on a stable platform. Version updates are fully tested on a separate test platform before installation on the beta test system (pcmdi9). One possible configuration is to incorporate the current federation-wide indexing from pcmdi11 into a VM running on pcmdi9. The test platform is on a system other than pcmdi9 and pcmdi11.
Index nodes periodically crawl THREDDS catalogs on current ESG production data nodes.
Registration on the beta test system is limited to a select set of users. Test users can use their pcmdi3 openIDs.
Notification of the transition is made on the pcmdi3 gateway.
The P2P web interface displays a banner indicating it is a test system only, and not all data nodes are accessible through the test system.

Note: It is NOT assumed that data is downloadable from all existing data nodes at this point. That functionality becomes available in Phase 3.

Phase 2: Initial node transition

Publication and PKI functionality is enabled on a single P2P data node. A wider set of users can access the P2P front-end and authenticate with their existing openID. This phase consists of two parts:

2a: The minimal set of steps required to generate a node transition guide and allow other nodes to begin transitioning.
2b: Technical requirements that can proceed in parallel with Phase 3.

2a) Milestones (prerequisite to Phase 3):

Publication is fully tested and functional on a P2P data node (pcmdi9). Publication permission can be granted to specific non-root users.
Non-replicated CMIP5 datasets are available to end users on both existing (pcmdi7) and P2P (pcmdi9) data nodes.
Production publication to the existing gateway(s) continues on current data nodes. In particular, publication of replicated datasets continues on pcmdi7.
Index nodes periodically crawl THREDDS catalogs on current ESG production data nodes, as before. The test index node on pcmdi9 shows availability of PCMDI CMIP5 datasets on both pcmdi7 and pcmdi9.
A node installation guide is available, detailing steps for nodes to reconfigure for P2P publication and transfer existing node databases and catalogs.The transition can be done in either of two ways: install a new P2P node on a separate platform, or upgrade an existing platform to P2P compatibility.The guide describes, among other things, how to structure certificates and truststores on the data node to accomodate publication to either system.
User database is periodically transferred from the existing IDP on pcmdi3 to the new IDP on pcmdi9. An IDP transition guide is available, which details steps to transfer existing user accounts.
A Bugzilla bug-reporting system is enabled on the PCMDI P2P system.

2b) Milestones(can proceed in parallel with Phase 3,

Publication supports P2P search capabilities.
The CIM viewer is integrated and tested
User registration supports:
- password change
- email change
- ability to enroll and view group membership
The administrator interface supports administering groups and users:
- control of group membership for non-automatic groups

Phase 3: Full node transition

Publication and PKI functionality is enabled on all P2P data nodes. P2P IDP nodes have mechanisms in place to quickly transfer user information from existing IDPs. Data nodes are configured such that they can publish to an existing gateway or a P2P index node, with the switch requiring only minimal changes to configuration files.

Milestones:

All data nodes are transitioned to be P2P compatible. The installation guide is leveraged to assist node administrators with building or upgrading to a P2P data node.
- A status page tracks the progress of each data node
All data nodes complete a compatibility checklist, certifying readiness to move to the P2P system.
Production publication to existing gateways continues. P2P data nodes may publish to a P2P index node, in which case their THREDDS catalogs are not scanned.
A fully populated search database, accessible by the search API, is available for end-users on all index nodes.

Phase 4: Production mode

Transition to P2P is complete.

Milestones:

There is a cutover period, as short as possible, during which:

New P2P installations are frozen.
Publication to existing gateways is frozen.
A cutover scan (crawl) of existing THREDDS catalogs is executed. This preserves the state of the search index at the point of cutover. After this point no more catalog crawling will take place; it will be up to data node publishers to maintain the index of catalogs for their institutions.
- Note: With one exception, the final catalog scan will be run on pcmdi11 as currently happens.
- The exception is the University of Tokyo node, which will be run on pcmdi9. This will allow modification of existing entries.
- For catalogs scanned on pcmdi11, data node publishers will be responsible for republishing data on either pcmdi9 or another index node. When this republication is complete, the existing index on pcmdi11 will be removed to avoid duplication of catalog indices.
Creation of user accounts on existing gateways is frozen. Users are redirected to one or more P2P IDPs. A cutover scan of user accounts is made at each IDP.
All data nodes switch to a tested P2P configuration. This means that the publication address as defined by hessian_service_url in esg.ini points to a P2P index node such as pcmdi9.

At the end of the cutover period:

Existing gateways are deprecated. pcmdi3 will be maintained as long as needed to support authentication of pcmdi3 openids. pcmdi7 will be maintained as long as needed to verify membership in the CMIP5 Research group. However, pcmdi3 and other non-P2P gateways will eventually be shut down when no longer needed.
Users can create new accounts on P2P systems, but should not create accounts on non-P2P system. Accounts on pcmdi3 systems will no longer be mirrored on pcmdi9.
Publication of new data is to P2P index nodes only. Same for modification or deletion of existing datasets. The P2P distributed index will diverge from the non-P2P search database.

Notes on republishing data

For the most part it is not strictly necessary to republish existing catalogs to a P2P index. The existing catalogs can be crawled (scanned) on the index node, which accomplishes essentially the same operation as republishing. The notable exception is when a new P2P data node has been installed on a different host. In that case it is necessary to republish data to a P2P index server:

Step 1: Generate a version list

When data is republished it is important to retain the existing dataset version numbers. This is done specifying a version list when esgpublish is run. The list is a text file, each line of which has the form 'dataset_id|version'. Note that the dataset_id does not include the version number. A straightforward way to generate a version list is to use the esgquery_index script to query the latest version numbers from an index node. For example, to obtain a version list for CCCMA:

% esgquery_index -q institute=CCCMA,replica=false,latest=true,data_node=dap.cccma.uvic.ca --fields master_id,version -d\| --format wide | cut -d\| -f 3,4 | sort >! cccma_versions.txt

Step 2: Create a mapfile

If a mapfile is available for existing datasets, it can be reused. Otherwise to generate one:

% esgscan_directory --read-files --project cmip5 -o mapfile.txt directory [directory ...]

where data files are located in the listed directories.

Step 3: Republish

Once the mapfile and version list are created, publication is straightforward:

% esgpublish --map mapfile.txt --project cmip5 --version-list cccma_versions.txt
% esgpublish --map mapfile.txt --project cmip5 --version-list cccma_versions.txt --noscan --thredds
% esgpublish --map mapfile.txt --project cmip5 --version-list cccma_versions.txt --noscan --publish

Note that for performance reasons it is often useful to split a very large mapfile into smaller chunks, making sure that each dataset is contained in at most one mapfile. In general it's a good idea to limit the size of a mapfile to 50,000 files or less.

ESGF_Transition

ESGF Transition Guide

Phase 1: Beta Test

Phase 2: Initial node transition

Phase 3: Full node transition

Phase 4: Production mode

Notes on republishing data

Step 1: Generate a version list

Step 2: Create a mapfile

Step 3: Republish

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!