Heterogeneous Computing. Mohamed ZahranЧитать онлайн книгу.

and data consistency.

Table 1.1 shows a comparison between the current (volatile) memory technologies used for caches and main memory, namely, DRAM and SRAM, and the new nonvolatile memory (NVM) technologies. The numbers in the table are approximate and collected from different sources but for the most part are from Boukhobza et al. [2017]. Many of the nonvolatile memory technologies have much higher density than DRAM and SRAM; look at the cell size. They also have comparable read latency and even lower read power in most cases. There are several challenges in using NVM that need to be solved and are shown in the table. For instance, write endurance is much lower than DRAM and SRAM, which causes a reliability problem. The power needed for write is relatively high in NVM. Consistency is also a big issue. When there is a power outage, we know that the data in DRAM and SRAM are gone. But for NVM when do not know whether the data stored are stale or updated. The power may have gone off while in the middle of a data update. A lot of research is needed to address these challenges. NVM can be used in the memory hierarchy at a level by itself, for example, as a last-level-cache (LLC) or in main memory, which is a vertical integration. NVM can also be used in tandem with traditional DRAM or SRAM, which is a horizontal integration. The integration of NVM in the memory hierarchy can be managed by the hardware, managed by the operating system, or left to the programmer to decide where to place the data. The first two cases are beyond the programmer’s control. In the near future, memory hierarchy is expected to include volatile and nonvolatile memories, adding to the heterogeneity of the memory system.

Table 1.1:Comparison of Several Memory Technologies

Figure 1.2 shows a summary of the factors that we have just discussed.

Figure 1.2Factors Introducing Heterogeneity in Memory

1.3Heterogeneity Within Our Control

In the previous section we explored what happens under the hood that makes the system heterogeneous in nature. In this section we explore factors that are under our control and make us use the heterogeneity of the system. There is a big debate on how much control to give the programmer. The more control the better the performance and power efficiency we may get, depending of course on the expertise of the programmer, and the less the productivity. We discuss this issue later in the book. For this section we explore, from a programmer perspective, what we can control.

1.3.1The Algorithm and The Language

When you want to solve a program, you can find several algorithms for that. For instance, look at how many sorting algorithms we have. You decide which algorithm to pick. We have to be very careful here. In the good old days of sequential programming, our main issues were the big-O notation. This means we need to optimize for the amount of computations done. In parallel computing, computation is no longer the most expensive operation. Communication among computing nodes (or cores) and memory access are more expensive than computation. Therefore, it is sometimes wiser to pick a worse algorithm in terms of computation if it has a better communication pattern (i.e., less communication) and a better memory access pattern (i.e., locality). You can even find some algorithms with the same big-O, but one of them is an order of magnitude slower than the other.

Once you pick your algorithm, or set of algorithms in the case of more sophisticated applications, you need to translate it to a program using one of the many parallel programming languages available (and counting!). Here also you are in control: which language to pick. There are several issues to take into account when picking a programming language for your project. The first is how suitable this language is for the algorithm at hand. Any language can implement anything. This applies to sequential and parallel languages. But some languages are much easier than others for some tasks. For example, if you want to count the number of times a specific pattern of characters appears in a text file, you can write a C program to do it. But a small Perl or Python script will do the job in much fewer lines. If you want less control but higher productivity, you can pick some languages with a higher level of abstraction (like Java, Scala, Python, etc.) or application-specific languages. On the other hand, the brave souls who are using PThreads, OpenMP, OpenCL, CUDA, etc., have more control yet the programs are more sophisticated.

Algorithm 1AX= Y: Matrix-Vector Multiplication

for i = 0 to m- 1 do

y[i] = 0;

for j = 0 to n - 1 do

y[i] += A[i][j] * X[j];

end for

1.3.2The Computing Nodes

When you pick an algorithm and a programming language, you already have in mind the type of computing nodes you will be using. A program, or part of a program, can have data parallelism (single thread–multiple data), so it is a good fit for graphics processing units (GPUs). Algorithm 1.1 shows a matrix ( m × n) vector multiplication, which is a textbook definition of data parallelism. As a programmer, you may decide to execute it on a GPU or a traditional multicore. Your decision depends on the amount of parallelism available; in our case, it is the matrix dimension. If the amount of parallelism is not very big, it will not overcome the overhead of moving the data from the main memory to the GPU memory or the overhead of the GPU accessing the main memory (if your GPU and runtime supports that). You are in control.

If you have an application that needs to handle a vast amount of streaming data, like real-time network packet analysis, you may decide to use a field-programmable gate array (FPGA).

With a heterogeneous computing system, you have control of which computing node to choose for each part of your parallel application. You may decide not to use this control and use a high-abstraction language or workflow that does this assignment on your behalf for the sake of productivity—your productivity. However, in many cases an automated tool does not produce better results than a human expert, at least so far.

1.3.3The Cores in Multicore

Let’s assume that you decided to run your application on a multicore processor. You have another level of control: to decide which thread (or process) to assign to which core. In many parallel programming languages, programmers are not even aware that they have this control. For example, in OpenMP there is something called thread affinity that allows the programmer to decide how threads are assigned to cores (and sockets in the case of a multisocket system). This is done by setting some environment variables. If you use PThreads, there are APIs that help you assign thread to cores, such as pthread_setaffinity_np().

Not all the languages allow you this control though. If you are writing in CUDA, for example, you cannot guarantee on which streaming multiprocessor(SM), which is a group of execution units in NVIDIA parlance, your block of threads will execute on. But remember, you have the choice to pick the programming language you want. So, if you want this control, you can pick a language that allows you to have it. You have to keep in mind though that sometimes your thread assignments may be override by the OS or the hardware for different reasons such as thread migration due to temperature, high overhead on the machine from other programs running concurrently with yours, etc.

1.4Seems Like Part of a Solution to Exascale Computing

If we look at the list of the top 500 supercomputers in the world,¹ we realize that we are in the petascale era. That is, the peak performance that such a machine can reach is on the order of 1015 floating point operations per second (FLOPS). This list is updated twice a year. Figure 1.3

Скачать книгу