MONARC Progress Report LCB 99-xx

LCB 99-xx

Models of Networked Analysis at Regional Centres for LHC Experiments

(MONARC)

PROGRESS REPORT

Version of 11:00 24^th May 1999

Prepared by

M. Aderholz (MPI), K. Amako (KEK), E. Arderiu Ribera (CERN), E. Auge (L.A.L/Orsay), G. Bagliesi (Pisa/INFN), L. Barone (Roma1/INFN), G. Battistoni (Milano/INFN), J. Bunn (Caltech/CERN), J. Butler (FNAL), M. Campanella (Milano/INFN), P. Capiluppi (Bologna/INFN), M. Dameri (Genova/INFN), D. Diacono (Bari/INFN), A. di Mattia (Roma1/INFN), U. Gasparini (Padova/INFN), F. Gagliardi (CERN), I. Gaines (FNAL), P. Galvez (Caltech), C. Grandi (Bologna/INFN), F. Harris (Oxford/CERN), K. Holtman (CERN), V. Karimäki (Helsinki), J. Klem (Helsinki), M. Leltchouk (Columbia), D. Linglin (IN2P3/Lyon Computing Centre), P. Lubrano (Perugia/INFN), L. Luminari (Roma1/INFN), M. Michelotto (Padova/INFN), I. McArthur (Oxford), H. Newman (Caltech), S.W. O'Neale (Birmingham), B. Osculati (Genova/INFN), M. Pepe (Perugia/INFN), L. Perini (Milano/INFN), J. Pinfold (Alberta), R. Pordes (FNAL), S. Rolli (Tufts), T. Sasaki (KEK), L. Servoli (Perugia/INFN), R.D. Schaffer (Orsay), M. Sgaravatto (Padova/INFN), T. Schalk (BaBar), J. Shiers (CERN), L. Silvestris (Bari/INFN), G.P. Siroli (Bologna/INFN), K. Sliwa (Tufts), C. Stanescu (Roma3/INFN), T. Smith (CERN), C. von Praun (CERN), E. Valente (INFN), I. Willers (CERN), R. Wilkinson (Caltech), D.O. Williams (CERN)

Executive Summary

Laura Perini will contribute one page.

To be consistent with the PEP this is not labelled as a chapter (you may think of it as chapter 0).

Chapter 1: Introduction

This comes from Harvey Newman and should describe the structure of the document.

Chapter 2: Deprecated chapter 2 ..Progress Reports of the Working Groups

The Working Groups present Progress Reports. However they now have a chapter each, this is retained to show the PEP-style markup of the division one level below chapters.

2.1 Architecture

Blah

2.2 Testbed

Blah

2.3 Analysis and Simulation

Chapter 2: Progress Reports of the Architecture Working Group

2.1 Introduction

The basic task of the Architecture Working Group is to develop distributed computing system architectures for LHC which can be modeled to verify their performance and viability. To carry out this task, the group considers the LHC analysis problem in the "large". We start with the general parameters of an LHC experiment, such as

CPU, disk, and mass (archival) storage requirements;
the number and geographic distribution of collaborating institutions and individual physicists;
the nature and size of the various analysis tasks;
the availability of affordable network bandwidth;
the cost evolution of basic technologies; and
support requirements of the various activities.

From there we conduct detailed discussions about how the analysis task will be divided up between the computing facility at CERN and computing facilities located outside of CERN. We consider what kind of facilities will be viable given different analysis approaches and networking scenarios, what kind of issues each type of facility will face, and what kind of support will be required to sustain the facility and make it an effective contributor to LHC computing.

The general picture that has emerged from these discussions is:

LHC computing is at such a scale in all the parameters mentioned above, that a worldwide effort to accumulate the necessary resources will be required.
There is considerable uncertainty with respect to network bandwidth, especially where an ocean must be crossed. This means that it is necessary to develop several scenarios for how resources will be distributed and used.
The optimal use of physics and financial resources of CERN and the collaborations and nations involved will be achieved if we adopt a model based on a hierarchy of computing centres rather than a highly centralized model based at CERN, such as was employed for LEP computing. Centres are characterized by the range of services and the types of facilities they provide.
At the top of this hierarchy is a large centre at CERN which has the capability to perform all analysis-related functions but not the capacity to do them completely.
Below this is a collection of large, multiservice centers with capacities that are a significant fraction -- 10-20% -- of CERN's. We call these Tier1 Regional Centres (RCs) and will summarize their characteristics below.
Below the Tier1 RCs, there MAY BE smaller centres, called Tier2 Centres, with capabilities and capacities which are more limited than the Tier1 centres but are nevertheless significant.
These Tier1 and Tier2 centres MAY BE further augmented by 'special purpose' centres or 'service centres' whose task may be to provide one or a few specific services to (typically) a single collaboration.

The primary motivation for a hierarchical collection of computing resources, called Regional Centres, is to maximize the intellectual contribution of physicists all over the world, without requiring their physical presence at the CERN. An architecture based on RCs allows an organization of computing tasks which may take advantage of physicists no matter where they are located. Next, the computing architecture based on RCs is an acknowledgement of the facts of life about network bandwidths and costs. Short distance networks will always be cheaper and higher bandwidth than long distance (especially intercontinental) networks. A hierarchy of centers with associated data storage ensures that network realities will not interfere with physics analysis. Finally, RCs provide a way to utilize the expertise and resources residing in computing centres throughout the world. For a variety of reasons it is difficult to concentrate resources (not only hardware but, more importantly, personnel and support resources) in a single location. A RC architecture will provide greater total computing resources for the experiments by allowing flexibility in how these resources are configured and located. A corollary of these motivations is that the RC model allows one to optimize the efficiency of data delivery/access by making appropriate decisions on processing the data. One important motivation for having such 'large' Tier1 RCs is to have centres with a critical mass of support people while not proliferating centres which would then create an enormous coordination problem for CERN and the collaborations.

There are many issues with regard to this approach. Perhaps the most important involves the coordination of the various Tiers. While the group has a rough understanding of the scale and role of the CERN centre and the Tier1 RCs, whether we need Tier2 centres and special purpose centres and what their roles should be is much less clear. Also, there are a variety of approaches to actually implementing a Tier1 centre. Regional centres may serve one or more than one collaboration and each arrangement has its advantages and disadvantages.

To keep its discussions well grounded in reality, the group has undertaken the following tasks, which are described in the MONARC Project Execution Plan (PEP):

A survey of the computing architectures of selected existing HEP experiments;
A survey of the computing architectures of experiments that are just now coming on or are coming on in the next year or so;
Discussions and meetings with representatives of proposed Regional Centre candidate sites concerning their proposed level of services and support, architecture, and management;
Technology evaluation and cost tracking; and
Network performance and cost tracking.

Items 1 and 2 help us develop models to input to the Simulation and Test Bed Working groups. Item 3 is essential to ensure that the proosed models of distributed computing are "real" in the sense that thay are compatible with the views of likely Tier1 RC sites. Items 4 and 5 keep model building within the boundaries of available technology and funding.

2.2 Results from the Last Year

This year, the Architecture Working Group has produced three documents that have been submitted to the full collaboration and is beginning work on a fourth:

Report on Computing Architectures of Existing Experiments, V.O'Dell et al;
Rough Sizing Estimates for a Computing Facility for a Large LHC Experiment, Les Robertson; and
Regional Centers for LHC Computing, Luciano Barone et al.; and
Report on Computing Architectures of Future Experiments (in progress).

The first three documents are available at http://www.cern.ch/MONARC/docs/monarc_docs.htm They are summarized briefly along with the plans for the fourth document.

2.2.1 Report on Computing Architectures of Existing Experiments

This survey included:

all four LEP experiments at CERN;
CDF and DZERO in RUN I (1992-995) at Fermilab;
Zeus at DESY;
the CERN Fixed Target Experiments NA48 and NA45; and
the Fermilab Fixed Target Experiments KTEV and FOCUS.

The main conclusion from this report is that LHC experiments are at such a different scale from these experiments and technology has changed so much since some of them ran, that LHC experiments will need a new model of computing. We can, however, derive valuable lessons on individual topics and themes from these experiments.

Some of the most important lessons learned were:

Scale of Computing: LHC experiments will require 60 times more CPU and will generate 10 times more data than CDF projects for RUN II (2000-2003).
Issues related to distributed computing:
- There was only limited success in this area by these experiments. Most successes were in the area of distributed event simulation.
- Support and continuity at remote sites was identified as a major problem for distributed computing.
- Maintaining the code base and calibration constants at remote sites was a major challenge.
- Hardware and operating system differences between the central facility at CERN and the remote sites were also sources of problems.
Issues related to planning: An extensive analysis of the early planning for LEP computing indicated a definite tendency to underestimate the resource requirements in some part due to political and budgetary considerations.

2.2.2 Rough Sizing Estimates for a Computing Facility for a Large LHC experiment, Les Robertson;

This document was prepared by Les Robertson of CERN IT. It attempts to summarize the rough capacities needed for the analysis of an LHC experiment and to derive from them the size of the CERN central facility and a Tier1 Regional Centre. The information has been obtained from estimates by CMS and cross checked with ATLAS and with the MONARC Analysis Working group. Some adjustments have been made to the numbers obtained from the experiments to account for overheads that are now measured but were not when the original estimates were made. While the result has not yet been reviewed by CERN management, it currently serves as our best indication of thinking on this topic at CERN so we are using it as the basis for proceeding.

It is believed that CERN will be able to satisfy about 1/2 of the aggregate computing need of the LHC experiments. The remainder must come from elsewhere. The view expressed by the author is that it must come from a 'small' number of Tier1 Regional Centres so that the problems of maintaining coherence and coordinating all the activities is not overwhelming. This sets the size of Tier1 RCs at 10-20% of the CERN centre in capacity.

2.2.3 Regional Centers for LHC Computing

Based on Les Robertson's estimates and the issues raised about the problems with distributed computing in the past by the survey Computing Architectures of Existing Experiments, we developed a framework for discussing Regional Centres and produced a document which gives a profile of a Tier1 Regional Centre.

This profile is based on facilities (and the corresponding capacities) and services (capabilities) which need to be provided to users. There is a clear emphasis on data access by users since this is seen as one of the largest challenges for LHC computing.

It is important to recognize that MONARC cannot and does not want to try to dictate the details of the Regional Centre architecture. That is best left to the collaborations, the candidate sites, and to CERN to work out on a case by case basis. MONARC wants to provide a forum for the discussion of how these centres will get started and develop and can play the role of facilitator of the effort to locate candidate centres and bring them into the discussion.

The report describes the services that we believe that CERN will supply to LHC data analysis (based on the work of Les and his team). These include:

Online data acquisition and storage
Possible data preprocessing before reconstruction
First data reconstruction
Support for data analysis onsite by a group of a few hundred physicists per experiment

CERN will have the original or master copy of the following data:

the raw data;
the master copy of the calibration data; and
a complete copy of all ESD (recconstructed), AOD (DST), and tag (thumbnails, microDST) data.

The regional centres will provide:

All technical services and data services required to do physics analysis;
All (or a large fraction) of ESD, AOD, TAG and calibration data. They will also have a fixed fraction of the raw and reconstructed data.
Caching or mirroring of all calibration constants.
Excellent network connectivity to CERN and to the users in the region principally served by the centre.
human resources to develop or share in the development of common maintenance, validation, and production software with CERN and the collaboration.
a fair share of the post-reconstruction or re-reconstruction processing.
Human resources to work on common projects with CERN and the collaborations.
Services to members of other regions on a best effort basis.
Excellent support services, training, documentation, and trouble shooting at the RC and at remote sites served by the RC.

Support is called out as a key element in achieving the smooth functioning of this distributed architecture. It is essential for the RC to provide a critical mass of user support. It is also noted that since this is a committment that extends over a long period of time, permanent staff, a budget for hardware evolution , and support for R&D into new technologies must be provided.

2.2.4 Report on Computing Architectures of Future Experiments

Work on this report is just beginning. It will include a study of BaBar at SLAC, CDF and D0 Run II at Fermilab, COMPASS at CERN, and the STAR experiment at RHIC. The approach will be to survey the available public literature on these experiments and to abstract information that is particularly relevant to LHC computing. This can be supplemented where required by discussions with leaders of the computing and analysis efforts.There will not be an attempt to create complete, self-contained expositions of how each experiment does all its tasks. We will have a 'contact-person' for each exeriment who will be responsible for gathering the material and summarizing it for the report. Most of these 'contact-persons' are now in place. There will be an overall editor for the final report.

2.2.5 First meeting of Regional Centre Representatives:

On April 13, there was a meeting of representatives of potential Regional Centre sites. It was felt at this point that we had made good progress in understanding the issues of how Regional Centres could contribute to LHC computing and it was now time to share this with possible candidates, to hear their plans for the future, and to get their feedback on our discussions. The three documents discussed above, which had been made available in advance of the meeting, were summarized briefly. We then heard presentations from IN2P3/France, INFN/Italy, LBNL/US(ATLAS), FNAL/US(CMS), UK, Germany, KEK/Japan(ATLAS), Russia/Moscow. Transparencies of these presentations and a summary may be found at

http://www.fnal.gov/projects/monarc/task2/rc_mtg_apr_23_99.html

While nothing is yet certain, it did appear that there were several candidates for Regional Centres that have a good chance to get support to proceed and will be at a scale roughly equivalent to MONARC's profile of a Tier1 RC. It was also clear that there would be several styles of implementation of the RC concept. One variation is that several centres saw themselves serving all four major LHC experiments but others, especially in the US and Japan, will serve only single experiments. Another variation is that some Tier1 RCs will be located at a single site while others may be somewhat distributed themselves although presumably quite highly integrated.

2.2.6 Technology Tracking

The main initiative in technology tracking was to take advantage of CERN IT efforts in this area. We heard a report on the evolution of CPU costs by Sverre Jarpe of CERN who serves on a group called PASTA which is tracking processor and storage technologies. We look forward to additional such presentations in the future.

2.3 Goals and Milestones for the July-December period

2.3.1 Complete the Report on Computing Architectures of Future Experiments by mid-July

2.3.2 Produce the final document on the Regional Centres by the end of the year

2.3.3 Begin to develop realistic models

Begin the task of developing models of computing that can be simulated. Focus on simulations which emphasize the large scale production, data management, and analysis issues. Worry about real world issues such as priority assignmetns and scheduling.

Chapter 3: Progress Reports of the Testbed Working Group

The Working Groups present Progress Reports.

Chapter 4: Progress Reports of the Analysis and Simulation Working Groups

The Working Groups present Progress Reports.

Chapter 5: Workplan and Schedule

Laura

Chapter 6: Ideas for Phase Three

Harvey

Conclusions

(Is this another chapter or a sub section of Ideas?) Harvey

Chapter 7: References

The references (in the PEP) were more like a list of further reading and were not necessarily cited in the text. This makes editting a lot easier. The hypertext links worked and were presented in plain text for the use of a reader with a printed copy.

If we cite MONARC internal documents then they should be in a pulicly accessible area. Do we have a mechanism to store and distribute printed copies ?

MONARC Home Page.

http://www.cern.ch/MONARC/

MONARC PAP, June 1998

http://atlasinfo.cern.ch/Atlas/GROUPS/WWCOMP/pap_june30.html

MONARC PEP, September 1998

http://home.cern.ch/~soneale/monarc/pep-fab.html

LCB 99-xx

Models of Networked Analysis at Regional Centres for LHC Experiments

(MONARC) PROGRESS REPORT

(MONARC)

PROGRESS REPORT