外文翻譯--gpu集群的混合并行編程

上傳人：奔*** IP屬地：河北更新時(shí)間：2024-03-01 格式：docx 頁(yè)數(shù)：11 大?。?58.54KB 人氣指數(shù)：12 舉報(bào) 版權(quán)申訴

已閱讀1頁(yè)，還剩10頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、　　1700單詞，9100英文字符，3100漢字　　出處：Yang C T, Huang C L, Lin C F, et al. Hybrid Parallel Programming on GPU Clusters[C]// International Symposium on Parallel and Distributed Processing with Applica

2、tions. IEEE Computer Society, 2010:142-147.　　附錄1　　Hybrid Parallel Programming on GPU Clusters　　Abstract—Nowadays, NVIDIA,s CUDA is a gener

3、al purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions - a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has pro

4、ven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on productio

5、n and research co　　Keywords: CUDA, GPU, MPI, OpenMP, hybrid, parallel programming　　INTRODUCTION　　Nowadays, NVIDIA’s CUDA is a general purpose scalable parall

6、el programming model for writing highly parallel applications. It provides several key abstractions - a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at pr

7、ogramming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes.</

8、p>　　In NVDIA the CUDA chip, all to the core of hundreds of ways to construct their chips, in here we will try to use NVIDIA to provide computing equipment for parallel computing. This paper proposes a solut

9、ion to not only simplify the use of hardware acceleration in conventional general purpose applications, but also to keep the application code portable. In this paper, we propose a parallel programming approach using hybr

10、id CUDA, OpenMP and MP programming, which partition loop iterations according 　　In this paper, we propose a general approach that uses performance functions to estimate performance weights for each n

11、ode. To verify the proposed approach, a heterogeneous cluster and a homogeneous cluster were built. In our mplementation, the master node also participates in computation, whereas in previous schemes, only slave nodes do

12、 computation work. Empirical results show that in heterogeneous and homogeneous clusters environments, the proposed approach improved performance over all previo　　The rest of this paper is organized

13、as follows. In Section 2, we introduce several typical and well-known self-scheduling schemes, and a famous benchmark used to analyze computer system performance. In Section 3, we define our model and describe our approa

14、ch. Our system configuration is then specified in Section 4, and experimental results for three types of application program are presented. Concluding remarks and future work are given in Section 5.

15、　BACKGROUND REVIEW　　A. History of GPU and CUDA　　In the past, we have to use more than one computer to multiple CPU parallel computing, as shown in the last chip in the history of

16、 the beginning of the show does not need a lot of computation, then gradually the need for the game and even the graphics were and the need for 3D, 3D accelerator card appeared, and gradually we began to display chip for

17、 processing, began to show separate chips, and even made a similar in their CPU chips, that is GPU. We know that GPU computing could be used to get th　　CPU using advanced flow control such as branch

18、predict or delay branch and a large cache to reduce memory access latency, and GPU's cache and a relatively small number of flow control nor his simple, so the method is to use a lot of GPU computing devices to cover

19、 up the problem of memory latency, that is, assuming an access memory GPU takes 5 seconds of the time, but if there are 100 thread simultaneous access to, the time is 5 seconds, but the assumption that CPU time memory ac

20、cess time is 0.1 　　Therefore, we in the arithmetic logic by GPU advantage, trying to use NVIDIA's multi-core available to help us a lot of computation, and we will provide NVIDIA with so many cor

21、e programs, and NVIDIA Corporation to provide the API of parallel programming large number of operations to carry out.　　We must use the form provided by NVIDIA Corporation GPU computing to run it? No

22、t really. We can use NVIDIA CUDA, ATI CTM and apple made OpenCL (Open Computing Language), is the development of CUDA is one of the earliest and most people at this stage in the language but with the NVIDIA CUDA only sup

23、ports its own graphics card, from where we You can see at this stage to use GPU graphics card with the operator of almost all of NVIDIA, ATI also has developed its own language of CTM, APPLE also propo<p&g

24、t;　　B. CUDA Programming　　CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA. CUDA is the computing engine in NVIDIA graphics processing

25、 units or GPUs that is accessible to software developers through industry standard programming languages. The CUDA software stack is composed of several layers as illustrated in Figure 2: a hardware driver, an applicatio

26、n programming interface (API) and its runtime, and two higher-level mathematical libraries of common usage　　C. CUDA Processing flow　　In follow illustration, CUDA processing flow

27、is described as Figure 3. The first step: copy data from main memory to GPU memory, second: CPU instructs the process to GPU, third: GPU execute parallel in each core, finally: copy the result from GPU memory to main mem

28、ory.　　SYSTEM HARDWARE　　A.Tesla C1060 GPU Computing Processor　　The NVIDIA® Tesla? C1060 transforms a workstation into a high-performance computer that ou

29、tperforms a small cluster. This gives technical professionals a dedicated computing resource at their desk-side that is much faster and more energy-efficient than a shared cluster in the data center. The NVIDIA® Tes

30、la? C1060 computing processor board which consists of 240 cores is a PCI Express 2.0 form factor computing add-in card based on the NVIDIA Tesla T10 graphics processing unit (GPU). This board is 　　A

31、computer system with an available PCI Express *16 slot is required for the Tesla C1060. For the best system bandwidth between the host processor and the Tesla C1060, it is recommended (but not required) that the Tesla C1

32、060 be installed in a PCI Express x 16 Gen2 slot. The Tesla C1060 is based on the massively parallel, many-core Tesla processor, which is coupled with the standard CUDA C Programming [14] environment to simplify many-cor

33、e programming.　　B. Tesla S1070 GPU Computing System　　The NVIDIA® Tesla? S1070 computing system speeds the transition to energy-efficient parallel computing. With 960 process

34、or cores and a standard simplifies application development, Tesla solve the world’s most important computing challenges--more quickly and accurately. The NVIDIA Computing System is a rack-mount Tesla T10 computing proces

35、sors. This system connects to one or two host systems via one or two PCI Express cables. A Host Interface Card (HIC) is used to connect each PCI Express cable t　　The Tesla S1070 GPU computing system

36、is based on the T10 GPU from NVIDIA. It can be connected to a single host system via two PCI Express connections to that connected to two separate host systems via connection to each host. Each NVID corresponding PCI Exp

37、ress cable connects to GPUs in the Tesla S1070. If only one PCI connected to the Tesla S1070, only two of the GPUs will be used.　　VI COCLUSIONS　　In conclusion, we propose a paral

38、lel programming approach using hybrid CUDA and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes n a GPU cluster which consist of one C1060 and one S1070.During the experiments,

39、loop progress assigned to one MPI processor cores in the same experiments reveal that the hybrid parallel multi-core GPU currently processing with OpenMP and MPI as a powerful approach of composing high performance clust

40、ers.　　附錄2　　GPU集群的混合并行編程　　摘要一一目前，NVIDIA的CUDA是一種用于編寫高度并行的應(yīng)用程序的通用的可擴(kuò)展的并行編程模型。它提供了幾個(gè)關(guān)鍵的抽象化概念-層次的線程塊，共享內(nèi)存和屏障同步。這種編程模式已經(jīng)被證明在多線程多核GPU和和從小規(guī)模擴(kuò) 展到

41、數(shù)百個(gè)內(nèi)核是非常成功的：在整個(gè)工業(yè)界和學(xué)術(shù)界的科學(xué)家早已經(jīng)使用 CUDA來(lái)實(shí)現(xiàn)生產(chǎn)上顯著的速度提升和代碼研究。在本文中，我們提出了一個(gè)使用混合CUDA和MPI編程的混合編程方法，根據(jù)在一個(gè)GPU集群中的C1060 GPU 節(jié)點(diǎn)的數(shù)目分區(qū)循環(huán)迭代，其中包括在一個(gè)C1060和一個(gè)S1070。循環(huán)迭代分配給一個(gè)由處理器在相同的計(jì)算節(jié)點(diǎn)的核心運(yùn)行的CUDA并行處理過(guò)的MPI進(jìn)程。　　關(guān)鍵詞：CUDA

42、，GPU，MPI，OpenMP，混合，并行編程　　1.介紹　　如今，NVIDIA® (英偉達(dá)?)的CUDA是一種通用的編寫高度可擴(kuò)展的并行編程并行應(yīng)用程序的模型。它提供了幾個(gè)關(guān)鍵的抽象化概念-層次的線程塊，共享內(nèi)存和障礙同步。這種編程模式已經(jīng)被證實(shí)在多線程多核心GPU編程和從小規(guī)模擴(kuò)展到數(shù)百個(gè)內(nèi)核是非

43、常成功的：科學(xué)家在工業(yè)界和學(xué)術(shù)界都已經(jīng)使用CUDA，來(lái)實(shí)現(xiàn)生產(chǎn)上顯著的速度提升和代碼研究。　　在NVDIA的CUDA芯片里，所有的數(shù)百種方法來(lái)構(gòu)建自己的芯片，在這里我們將嘗試使用NVIDIA® (英偉達(dá)?)提供用于并行計(jì)算的計(jì)算設(shè)備。本文提出了一個(gè)解決方案不僅簡(jiǎn)化在傳統(tǒng)的硬件加速通用應(yīng)用程序的使用，而且還保持應(yīng)用程序代碼的便攜性。在這篇論文里，我們提出一種使用混合CUDA，Op

44、enMP和MPI 的并行編程方法，根據(jù)在一個(gè)集群中的性能疊加的多核節(jié)點(diǎn)，它會(huì)分區(qū)循環(huán)迭代。因?yàn)榈幚矸峙浣o一個(gè)MPI進(jìn)程是在并行的相同的計(jì)算節(jié)點(diǎn)上由OpenMP 線程的處理器內(nèi)核運(yùn)行的，則循環(huán)迭代的次數(shù)分配給一個(gè)計(jì)算節(jié)點(diǎn)，每個(gè)調(diào)度步驟取決于在該節(jié)點(diǎn)的處理器內(nèi)核的數(shù)量。　　在本文中，我們提出了一種通用的方法，使用性能函數(shù)估計(jì)每個(gè)節(jié)點(diǎn)的性能權(quán)重。為了驗(yàn)證所提出的方法，我們建立了不同種類的集群

45、和一個(gè)同構(gòu)集群。在我們的實(shí)現(xiàn)中，主節(jié)點(diǎn)也參與計(jì)算，而在以往的計(jì)劃，只有從節(jié)點(diǎn)做計(jì)算工作。　　實(shí)證結(jié)果顯示，在異構(gòu)和同構(gòu)集群環(huán)境中，提出的方法改進(jìn)性能超過(guò)以往任何方　　本文的其余部分安排如下：在第2節(jié)，我們介紹幾種典型和著名的自我調(diào)度方案，和一個(gè)著名的用于分析計(jì)算機(jī)性能的基準(zhǔn)系統(tǒng)。在第3節(jié)中，我們定義我們的模型并且說(shuō)明我們的方法。然后我們的系統(tǒng)配置的

46、三種類型放在第4節(jié)，同時(shí)在第4節(jié)還有實(shí)驗(yàn)結(jié)果的應(yīng)用程序。結(jié)束語(yǔ)和今后的工作安排在第5節(jié)。　　背景回顧　　GPU和CUDA的歷史　　在過(guò)去，我們必須使用多臺(tái)計(jì)算機(jī)的多個(gè)CPU并行計(jì)算，如所示的最后一個(gè) 芯片中開(kāi)始并不需要大量的計(jì)算，然后逐漸人們有了游戲的需求，甚至是圖形和 3

47、D。由于3D加速器卡的需要出現(xiàn)，我們逐漸地開(kāi)始顯示芯片的加工，開(kāi)始展現(xiàn) 出獨(dú)立的芯片，甚至在他們的CPU芯片中做了一個(gè)類似的顯示芯片，這就是GPU。　　我們知道，用GPU計(jì)算可以得到我們想要的答案，但為什么我們選擇使用 GPU?這幻燈片顯示了當(dāng)前CPU和GPU的比較。首先，我們可以看到最多只有八核心CPU，但是GPU已發(fā)展到260核心，從核心數(shù)量上，我們就可以知道有很多 GPU上的并行程序，盡

48、管它有個(gè)比較低頻率的核心，我們認(rèn)為大量的并行計(jì)算能力可能會(huì)弱于單獨(dú)的一個(gè)。第二方面，我們知道，在GPU的存儲(chǔ)器內(nèi)，有更多訪問(wèn)主存儲(chǔ)器的次數(shù)。對(duì)比CPU和GPU上的訪問(wèn)內(nèi)存容量，我們發(fā)現(xiàn)，GPU的訪問(wèn) 速度比CPU快10倍，CPU整體的差90GB/S，這是相當(dāng)驚人的差距，當(dāng)然，這也意味著，當(dāng)計(jì)算訪問(wèn)大量的數(shù)據(jù)時(shí)能有一個(gè)良好的GPU來(lái)改善所需的時(shí)間。　　CPU采用了先進(jìn)的流量控制，如分支預(yù)測(cè)或

49、延遲分支和大容量高速緩存，以減少內(nèi)存訪問(wèn)延遲，GPU的高速緩存和一個(gè)相對(duì)較小的數(shù)流量控制也沒(méi)有它的簡(jiǎn) 單，所以這種方法是使用大量的GPU計(jì)算設(shè)備來(lái)掩蓋內(nèi)存的問(wèn)題，即，假設(shè)存取存儲(chǔ)器GPU需要5秒的時(shí)間，但如果有100個(gè)線程同時(shí)獲取的時(shí)間為5秒，假設(shè) CPU時(shí)間存儲(chǔ)器訪問(wèn)時(shí)間為0.1秒，如果100個(gè)線程訪問(wèn)時(shí)，則時(shí)間是10秒。因此，GPU并行處理可以用來(lái)隱藏缺點(diǎn)甚至超過(guò)存取記憶體的CPU的速度。GPU 的設(shè)計(jì)使得更多的晶體管致力于數(shù)

50、據(jù)處理，而非數(shù)據(jù)緩存和流量控制，如由圖1 所示。　　因此，我們通過(guò)GPU的優(yōu)勢(shì)來(lái)算術(shù)邏輯，試圖使用NVIDIA的多核心幫助計(jì)算，我們將提供NVIDIA的核心方案，以及NVIDIA公司提供的并行編程大量的 API操作來(lái)進(jìn)行。　　我們必須使用NVIDIA公司GPU計(jì)算所提供的形式運(yùn)行？不是的。我們可以利用NVIDIA CUDA，ATI CTM和蘋果提出的O

51、penCL (開(kāi)放計(jì)算語(yǔ)言），這些是CUDA 的發(fā)展正處于這個(gè)階段的語(yǔ)言，但隨著NVIDIA CUDA只支持自己的顯卡，我們可以在這個(gè)階段看到使用GPU顯卡的所有NVIDIA的運(yùn)營(yíng)商，ATI也開(kāi)發(fā)了自己的 CTM，蘋果還提出OpenCL (開(kāi)放計(jì)算語(yǔ)言），它的OpenCL已經(jīng)被NVIDIA和ATI 支持，但ATI CTM通過(guò)使用GPU之間的關(guān)系，也放棄了另一語(yǔ)言。通常只支持單精度浮點(diǎn)運(yùn)算，在科學(xué)精準(zhǔn)方面是一個(gè)非常重要的指標(biāo)，因此，今

52、年出臺(tái)的計(jì)算顯卡必須支持雙精度浮點(diǎn)運(yùn)算。　　圖1 cpu的投入晶體管處理數(shù)據(jù)　　CUDA編程　　CUDA (統(tǒng)一計(jì)算設(shè)備架構(gòu)的縮寫）是一個(gè)由NVIDIA® (英偉達(dá)?)開(kāi)發(fā)的并行計(jì)算架構(gòu)。在NVIDIA圖形處理單元或者GPUs中，CUDA是計(jì)算引擎，它是可以通

53、過(guò)業(yè)界標(biāo)準(zhǔn)來(lái)使用的軟件開(kāi)發(fā)編程語(yǔ)言。CUDA軟件棧由幾個(gè)層組成，如在圖2中所示：一個(gè)硬件驅(qū)動(dòng)程序，應(yīng)用程序編程接口（API)和它的運(yùn)行時(shí)間，兩個(gè)較高級(jí)別的數(shù)學(xué)庫(kù)常見(jiàn)的用法，CUFFT和CUBLAS。硬件被設(shè)計(jì)為支持輕量級(jí) 的驅(qū)動(dòng)程序和運(yùn)行時(shí)間層，因此有高性能的表現(xiàn)。CUDA架構(gòu)支持一系列的計(jì)算接口包括OpenGL和直接計(jì)算等。CUDA的并行編程模型是為了克服這一挑戰(zhàn)的同時(shí)保持了較低的學(xué)習(xí)曲線，使熟悉標(biāo)準(zhǔn)編程語(yǔ)言（如C)的程序員便

54、于學(xué)習(xí)。其核心是三個(gè)關(guān)鍵抽象概念一線程塊的層次結(jié)構(gòu)，共享存儲(chǔ)，和屏障同步，即以最小的語(yǔ)言擴(kuò)展簡(jiǎn)單地展示給程序員。　　圖2統(tǒng)一計(jì)算設(shè)備架構(gòu)　　CUDA處理流程　　后續(xù)說(shuō)明，第一步：將數(shù)據(jù)從主內(nèi)存拷貝到GPU內(nèi)存；第二，CPU發(fā)送指示到GPU;第三，GPU并行執(zhí)行；第四，拷貝

55、GPU內(nèi)存中的結(jié)果到主內(nèi)存中。　　系統(tǒng)硬件　　A.TeslaC1060 GPU計(jì)算處理器　　NVIDIA® (英偉達(dá)?) Tesla?系列C1060將工作站變?yōu)閮?yōu)于小的計(jì)算機(jī)集群的一個(gè)高性能計(jì)算機(jī)。這給了行業(yè)技術(shù)一個(gè)在自己的辦公桌邊比在數(shù)據(jù)中心的共享的群集更快，

56、更高效的專門的計(jì)算資源。NVIDIA® (英偉達(dá)?) Tesla?系列C1060 計(jì)算處理器板由240個(gè)核心組成，是一個(gè)PCI Express2.0的NVIDIA® (英偉達(dá)?) Tesla T10圖形處理單元（GPU)的基礎(chǔ)上的圖形計(jì)算附加卡。此板有針對(duì)性的用于PCI Express的高性能計(jì)算(HPC)解決方案系統(tǒng)。特斯拉C106有933GFLOPs/ 秒的處理性能，標(biāo)配4GBGDDR3內(nèi)存，102GB/s的帶

57、寬。　　TeslaC1060需要一個(gè)可用的PCI ExpressX16插槽的計(jì)算機(jī)系統(tǒng)。為了獲得主機(jī)的處理器和Tesla C1060之間最佳的系統(tǒng)帶寬，建議（但不要求）是Tesla C1060安裝在PCIExpressX16第二代插槽。特斯拉C1060基于大規(guī)模并行，多核心的Tesla處理器，再加上標(biāo)準(zhǔn)的CUDA的C語(yǔ)言編程的環(huán)境，以簡(jiǎn)化多核心編程。

58、　B.特斯拉S1070 GPU的計(jì)算系統(tǒng)　　NVIDIA® (英偉達(dá)?) Tesla?系列S1070計(jì)算系統(tǒng)加快了到節(jié)能高效的并行計(jì) 算的過(guò)度。擁有960個(gè)處理器內(nèi)核和一個(gè)標(biāo)準(zhǔn)的C編譯器，簡(jiǎn)化應(yīng)用程序的開(kāi)發(fā)，特斯拉S1070尺度更快，更準(zhǔn)確的解決世界上最重要的計(jì)算難題。NVIDIA® (英偉達(dá)?) Tesla S1070計(jì)算系統(tǒng)是一個(gè)1U機(jī)架安裝系統(tǒng)，它有四個(gè)特斯拉T10

59、計(jì) 算處理器。該系統(tǒng)通過(guò)一個(gè)或兩個(gè)PCI Express的線連接到一個(gè)或兩個(gè)主機(jī)系統(tǒng)。主機(jī)接口卡（HIC)是用來(lái)連接每個(gè)PCIExpress連接到主機(jī)上。主機(jī)接口卡兼容的 PCIExpress1X 和 PCIExpress2 個(gè)系統(tǒng)。　　Tesla S1070 GPU計(jì)算系統(tǒng)T10 NVIDIA® (英偉達(dá)?) GPU的基礎(chǔ)上。它可以通過(guò)兩個(gè)PCI Express連接到一個(gè)單獨(dú)的

60、主機(jī)系統(tǒng)連接到主機(jī)，或通過(guò)一個(gè)PCI Express連接到兩個(gè)獨(dú)立的主機(jī)系統(tǒng)連接到每一臺(tái)主機(jī)。每NVIDIA開(kāi)關(guān)和相應(yīng) 的PCI Express電纜連接到Tesla S1070的兩個(gè)四GPU (圖形處理器）。如果只有一個(gè)PCIExpress電纜連接的Tesla S1070，那么只有兩個(gè)GPU在使用。主機(jī) 必須有兩個(gè)可用的PCI Express槽和配置有兩個(gè)電纜，才能連接所有在特斯拉 S1070的四個(gè)GPU到單一的主機(jī)系統(tǒng)中。</

61、p>　　實(shí)驗(yàn)結(jié)論　　總之，我們提出了一個(gè)使用混合CUDA和MPI編程的并行編程方法，即根據(jù) C1060 GPU節(jié)點(diǎn)的數(shù)目分區(qū)的在包括一個(gè)C1060和一個(gè)S1070的GPU集群的循環(huán) 迭代。實(shí)驗(yàn)過(guò)程中，分配給一個(gè)MPI進(jìn)程的循環(huán)迭代由運(yùn)行在相同的計(jì)算節(jié)點(diǎn)的處理器內(nèi)核的CUDA并行地處理。實(shí)驗(yàn)表明，由OpenMP和MPI處理的混合并

眾賞文庫(kù)> 全部分類> 畢業(yè)設(shè)計(jì)

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

外文翻譯--gpu集群的混合并行編程

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

外文翻譯--gpu集群的混合并行編程

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

免費(fèi)下載