| |
|
09/14/2006, 9:30 AM - 11:00 AM
Speakers: Anirban Chakrabarti, Infosys Technologies
. Peter Brezany, University of Vienna, Austria
. A Min Tjoa, Vienna University of Technology. Ivan Janciak, University of Vienna, Austria
.
Towards High Productivity Analytics on the Grid
The mission of the DARPA research and development program ìHigh Productivity Computing Systems (HPCS)î is creating new generations of high end programming environments, software tools, architectures, and hardware components in order to implement a new vision of high end computing. In our presentation, we report on the realization of the HPCS vision in our analytics infrastructure called the GridMiner. In this context, we consider two productivity aspects, which are discussed below.
1. Productivity of the development of analytics services.
There are several programming language systems, e.g., Chapel, Unified Parallel C, and Titanium, which focus on productivity, in particular by combining the goal of highest possible object code performance with that of programmability offered by a high-level user interface. GridMiner's analytics tasks (classification, clustering, etc.) are being developed in Titanium [1] and deployed as Globus 4 services. Titanium is an explicitly parallel dialect of Java developed at UC Berkeley to support high-performance scientific computing on large-scale multiprocessors.
As an example, we will describe the development of a Grid service for scalable decision tree construction. So far, to our best knowledge, no effort was devoted to building classifiers based on federation of Grid high-performance computing resources. Our service can be used either autonomously - it can run parallel on a supercomputer - or as a building block to compose distributed and parallel classifiers. It leverages concepts introduced by SPRINT [2] but uses a different approach to dataset partitioning and workload and task assignment. In our approach, attributes or attribute groups are located at different worker nodes. Each worker node works on its local data and a master node builds the global classifier. Scalability and performance of the prototype implementation were examined and the results will be presented.
2. Productivity of the data exploration process.
Analytics processes are not implemented as monolithic codes. Instead, standalone processing phases are combined to process data and extract knowledge patterns in various ways. The analytics applications can now be viewed as complex workflows, which are highly interactive and may involve following processes: data cleaning, data integration, data selection, modelling (applying a data mining algorithm), and post-processing the mining results (e.g. visualization). There are many possible choices concerning functionality and parameters of each process and process composition into the workflow, and only some combinations are valid. Not only novices but data mining specialists as well need some automatic or semi-automatic support in constructing valid and efficient workflows. Additional challenges appear with the ongoing application of Grid technologies to data mining. The aim of our work is to automate this workflow construction process as much as possible using Semantic Web techn!
ologies. Our ontology-based workflow construction framework called OntoGridMiner will be presented and the scientific rationale for its design will be discussed. The framework is implemented on top of the GridMiner interactive workflow management system.
[1] K. A. Yelick, et al. Titanium: A High-Performance Java Dialect. Concurrency: Practice and Experience, Vol. 10, No. 11-13, September-November 1998.
[2] J. C. Shafer, R. Agrawal, and M. Mehta. SPRINT: A Scalable Parallel Classifier for Data Mining. In T. M. Vijayaraman et al. (eds.) Proc. 22nd Int. Conf. Very Large Databases, pages 544-555. Morgan Kaufmann, 3-6 1996.
Virtualized Credential and Policy Management in Inter-domain Grid
Enterprises are looking at Grid computing as a technology of enormous potential. However, there are several issues which require immediate attention before Grid can become an important component of the IT infrastructure. One such issue is the issue of Grid application migration, where legacy applications are migrated to the Grid environment. Most of the work on Grid application migration in practice is ad-hoc in nature. Migration typically involves significant investments in terms of domain knowledge and specialistís judgments on application analysis and re-engineering. The performance improvement is also not predictable. It is not surprising, therefore, that most of the enterprises are currently looking at applications which are ìlow hanging fruitsî or applications which can be grid-enabled without significant investments in analysis. However, to take Grid to the enterprises efforts need to be undertaken to analyze the performance of the legacy code running in a Grid-based infrastructure. The approach described in this paper attempts to fill the void through the Grid Application Migration Framework (GAMF). The framework consists of three parts: Grid Code Analyzer (GCA), Grid Task Generator (GTG), and Grid Simulator (GS). The GCA component takes a legacy program written in C, C++, or Java as input and generates a Directed Acyclic Graph (DAG) which depicts both task and data dependencies among the components of the program. In Grid Task Generator the DAG is further analyzed and Grid tasks of proper granularity are generated using a set of DAG reducing and clustering algorithms. The tasks generated in this step indicate the parallel tasks that can be put into the Grid infrastructure. The Simulator simulates the actual Grid execution by taking the reduced task graph as input and schedules the tasks on different processors in the Grid. The performance data is then analyzed to study the benefits of porting the application to Grid. The framework helps specialists to make informed decisions during the migration process. It is to be noted that each of the languages like C, C++, and Java present their own subtleties. In this presentation, we will talk about migration of mainly C programs. It is to be noted that GCA and GS components are independently deployable. While GCA component can be used for general purpose application analysis, GS can be used for performance analysis and Grid infrastructure tuning.
The presentation will describe in detail about the different components of the GAMF Framework. The presentation will also be supplemented with a case study where the framework is used for analysis and subsequent re-engineering.


|
|