MOTH: Moving threads realization study, 2009-2011

MOTH is a subproject of Martti Forsell's ISPA (Implementation Study for PRAM-like on-chip architecture) project.

Funding

The Academy of Finland has funded the project as part of VTT led ISPA project.

Group

PhD (leader) Ville Leppänen Home page email	Jari-Matti Mäkelä Home page	Jani Paakkulainen
Martti Forsell	Martti Penttonen

Overview

MOTH project studies the realization of a new kind of approach for mapping the computing of an application to MP-SOC architectures. Instead of moving data read and write requests, we move extremely lightweight threads between the processor cores. Each processor core is coupled with memory module and parts of each memory module together form a virtual shared memory abstraction. Applications are written using a high-level language based on shared memory. As a consequence of moving threads instead of data we avoid all kinds of cache coherence problems.

In our architecture, the challenge of having efficient implementation of an application reduces to mapping the used data so that the need to move threads is balanced with respect to the bandwidth of the communication lines. This method also eliminates the need for separate reply network and introduces a natural way to exploit locality without sacrificing the synchronicity of the PRAM model.

In the moving threads approach, a multicore system consists of P processor cores that are connected to each other with some sparse network, e.g. with a butterly, a sparse mesh, a mesh of tree, etc. In traditional approaches, the messages correspond to read or write requests and replies, whereas in the moving threads approach, a message moves a thread consisting of a program counter, an id number, and a small set of registers. The messages in the moving threads approach are a little bit longer, but respectively there is no need for a network deliver the replies of read requests.

A cache-based access to the memory system is provided via each processor core. However, each core sees only a unique fraction of the overall memory space, and thus there are no cache coherence problems and when a thread makes a reference out of the scope of the core's memory area, the referencing thread must be moved to the core that can access that part of the main memory. Besides a cache to access the data memory, each core also has another cache for program instructions.

Each of the cores has Θ(s) threads to execute, and the threads are independent of each other -- i.e. the core can take any of them and advance its execution. By taking an instruction cyclically from each thread, the core can wait for memory access taking a long time (and even tolerate the delays caused by moving the threads). The key to hide the memory (as well as network and other) delayes is that the average number of threads s per core must be higher than the expected delay of executing a single instruction from any thread.

The less there is need to move the threads, the smaller can the slackness factor be. Thus, although the moving threads approach does not require it, it might be wise to allow careful design of the allocation of actual data used in the program, and thus allow the programmer to balance the workloads and to minimize the movement of data. We can e.g. assume that the program's address space is statically distributed into the memories accessible via cache modules attached to each core. The advantage of this is that the programmer can have influence on the physical allocation of data -- and consequently on the physical allocation of the work of each thread on the processor- storage modules.

For the creation and termination of threads in the programming language level, we take the approach of supporting only implicit termination as well as creation of threads. We do not consider Java-like explicit declaration of threads as first-class objects as a feasible solution. In practice, we have a parallel loop-like construction which creates threads with logical id-numbers in the interval [low,high] and each threads is running the same program block. The code in the program block can of course depend on the logical processor id-number. The id-numbers are program controlled, but the runtime system expects them to be unique at anytime during the program execution. We also consider supporting nested thread creations. Each thread faces an implict termination at the end of the program block (which was defined in the thread creation statement).

Outcomes

Simulator for the MOTH architecture

A cycle-accurate simulator's executable and sources for the architecture are also available. One can run the programs created with the forementioned compiler using the simulator on any architecture supporting the Java(tm) platform.

Compiler for the MOTH architecture

A compiler + runtime library exists for the moving threads architecture.