NVBLAS 논문

Notice

Recent Posts

Recent Comments

Link

MJay

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

MJay

NVBLAS 논문 본문

카테고리 없음

NVBLAS 논문

MJSon 2017. 9. 12. 15:01

~~Edit~~

NVBLAS 논문

tistory NVBlas

Fatman vs LittleBoy

어떻게 구조가 되어있는지

대치어 없음 Wrapper라면 Spark에서 어떻게 작동하는 건지 알아봐야겠다.

OpenBLAS 랑 NVBLAS를 둘다 쓰면 된다.

sparse matrix에서 는 성능이 안좋다.

you might need to pick the appropriate value of NVBLAS_TILE_DIM from nvblas.conf, because for older GPU the default value is too big and some operations might return zero.

CublasXT is a set of routines which accelerate Level 3 BLAS (Basic Linear Algebra Subroutine) calls by spreading work across more than one GPU. By using a streaming design, cublasXT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.

netlib-java is a wrapper for low-level BLAS, LAPACK and ARPACK that performs as fast as the C / Fortran interfaces with a pure JVM fallback. netlib-java is included with recent versions of Apache Spark.

Apache Spark makes use of a component called netlib-java, which provides a Java API for Linear Algebra routines, such as BLAS, LAPACK, etc. The netlib-java package doesn’t implement these directly, but rather delegates incoming calls to one of three implementations, in the following order:

On the other hand, we observe that, for sparse matrices with
5% of non-zero terms, hardware acceleration may not increase the performance much. Scala implementation of matrix mul- tiplication in Spark is faster or comparable to OpenBLAS, and much faster than NVBLAS, as reported in [43]. This is mainly because NVBLAS pads zero terms into the matrix and sends dense matrix blocks to the GPU, which causes extra data transfer between GPU and CPU, as well as extra computation in GPU and thus affects performance. Discussion:
sparse matrix 어떻게 나타내지

The two warnings above mean that the first two implementations were not usable, and MLlib is using the Java implementations under the covers.

If you have a modern Nvidia GPU in your system, you might be considering using Nvidia’s cuBLAS implementation. This is possible through Nvidia’s NVBLAS library, and you can read about how to configure it in netlib-java here. However, I don’t recommend going this route for Apache Spark. If you take a look at section 3 of the NVBLAS documentation, you’ll see that only a handful of BLAS3 routines get sent to the GPU: gemm, syrk, herk, syr2k, her2k, trsm, trmm, symm, and hemm. Of these, only gemm is available in Spark’s MLlib, and only if explicitly used by the end user; MLlib doesn’t make use of any of these internally. This means that all other BLAS calls will get directed to the backup implementation that you have defined in your nvblas configuration file, and won’t run on the GPU. If you’re on a RedHat system like me, this means you’ll likely be building and using the reference CBLAS library, and not using a better-performing library such as OpenBLAS. So unless you have an application that sends a large amount of data through explicit calls to Matrix.multiply(y: DenseMatrix), you’re not gaining anything for your efforts to get cuBLAS working in Spark.

Discussion: In Spark, the job of multiplication on dis-
tributed matrix is divided into a number of tasks, where each task calculates the multiplication of small block matrices. The
default size of each block is 1024 × 1024. The number of tasks generated by Spark increases as the matrix size increases. As the number of tasks increases, data shuffle overhead increases, diminishing advantages of leveraging the advan- tage of hardware accelerators for performing high throughput computations. However, even with this shuffle and framework overhead, hardware acceleration can still speed-up overall
performance up to more than 2× as shown in Section IV-A. On the flip side, we find that for matrix with small sizes, OpenBLAS performs similar to or faster than NVBLAS, but
for the 66331×66331 matrix, NVBLAS actually spends less time on computation than OpenBLAS. Thus, we expect the benefits from utilizing high-throughput accelerators such as GPUs to grow with growing matrix sizes.

Since Spark is not aware of any GPU devices, it assumes that each CPU core is performing multiplications assigned to it independently of the other ones. But in actuality, all CPU cores are delegating their matrix multiplication tasks to a single GPU device. This might lead to a queue for tasks incoming to the GPU device, and ultimately means that Spark is making decisions regarding the delegation of workload and the capabilities of its resources on false assumptions.
Finally,

OpenBLAS가 있나? ubuntu 이미지에 없다

%23%23%23%23%20NVBLAS%20%uB17C%uBB38%0A@%5Btistory%2CNVBlas%5D%0A%23%23%23%23%23Fatman%20vs%20LittleBoy%0A%0A%20%uC5B4%uB5BB%uAC8C%20%uAD6C%uC870%uAC00%20%uB418%uC5B4%uC788%uB294%uC9C0%0A%0A%20%uB300%uCE58%uC5B4%20%uC5C6%uC74C%20Wrapper%uB77C%uBA74%20Spark%uC5D0%uC11C%20%uC5B4%uB5BB%uAC8C%20%uC791%uB3D9%uD558%uB294%20%uAC74%uC9C0%20%uC54C%uC544%uBD10%uC57C%uACA0%uB2E4.%0A%0A%20OpenBLAS%20%uB791%20NVBLAS%uB97C%20%uB458%uB2E4%20%uC4F0%uBA74%20%uB41C%uB2E4.%0A%0A%20sparse%20matrix%uC5D0%uC11C%20%uB294%20%uC131%uB2A5%uC774%20%uC548%uC88B%uB2E4.%0A%0A%20you%20might%20need%20to%20pick%20the%20appropriate%20value%20of%20NVBLAS_TILE_DIM%20from%20nvblas.conf%2C%20because%20for%20older%20GPU%20the%20default%20value%20is%20too%20big%20and%20some%20operations%20might%20return%20zero.%0A%0ACublasXT%20is%20a%20set%20of%20routines%20which%20accelerate%20Level%203%20BLAS%20%28Basic%20Linear%20Algebra%20Subroutine%29%20calls%20by%20spreading%20work%20across%20more%20than%20one%20GPU.%20By%20using%20a%20streaming%20design%2C%20cublasXT%20efficiently%20manages%20transfers%20across%20the%20PCI-Express%20bus%20automatically%2C%20which%20allows%20input%20and%20output%20data%20to%20be%20stored%20on%20the%20host%u2019s%20system%20memory.%20This%20provides%20out-of-core%20operation%20%u2013%20the%20size%20of%20operand%20data%20is%20only%20limited%20by%20system%20memory%20size%2C%20not%20by%20GPU%20on-board%20memory%20size.%0A%0Anetlib-java%20is%20a%20wrapper%20for%20low-level%20BLAS%2C%20LAPACK%20and%20ARPACK%20that%20performs%20as%20fast%20as%20the%20C%20/%20Fortran%20interfaces%20with%20a%20pure%20JVM%20fallback.%20netlib-java%20is%20included%20with%20recent%20versions%20of%20Apache%20Spark.%0A%0A%0AApache%20Spark%20makes%20use%20of%20a%20component%20called%20netlib-java%2C%20which%20provides%20a%20Java%20API%20for%20Linear%20Algebra%20routines%2C%20such%20as%20BLAS%2C%20LAPACK%2C%20etc.%20The%20netlib-java%20package%20doesn%u2019t%20implement%20these%20directly%2C%20but%20rather%20delegates%20incoming%20calls%20to%20one%20of%20three%20implementations%2C%20in%20the%20following%20order%3A%0A%0AOn%20the%20other%20hand%2C%20we%20observe%20that%2C%20for%20sparse%20matrices%20with%20%0A5%25%20of%20non-zero%20terms%2C%20hardware%20acceleration%20may%20not%20increase%20the%20performance%20much.%20Scala%20implementation%20of%20matrix%20mul-%20tiplication%20in%20Spark%20is%20faster%20or%20comparable%20to%20OpenBLAS%2C%20and%20much%20faster%20than%20NVBLAS%2C%20as%20reported%20in%20%5B43%5D.%20This%20is%20mainly%20because%20NVBLAS%20pads%20zero%20terms%20into%20the%20matrix%20and%20sends%20dense%20matrix%20blocks%20to%20the%20GPU%2C%20which%20causes%20extra%20data%20transfer%20between%20GPU%20and%20CPU%2C%20as%20well%20as%20extra%20computation%20in%20GPU%20and%20thus%20affects%20performance.%20Discussion%3A%20%0A%09sparse%20matrix%20%uC5B4%uB5BB%uAC8C%20%uB098%uD0C0%uB0B4%uC9C0%0A%09%0A%21%5BAlt%20text%5D%28./1504705992551.png%29%0A%0A%0AThe%20two%20warnings%20above%20mean%20that%20the%20first%20two%20implementations%20were%20not%20usable%2C%20and%20MLlib%20is%20using%20the%20Java%20implementations%20under%20the%20covers.%0A%0AIf%20you%20have%20a%20modern%20Nvidia%20GPU%20in%20your%20system%2C%20you%20might%20be%20considering%20using%20Nvidia%u2019s%20cuBLAS%20implementation.%20This%20is%20possible%20through%20Nvidia%u2019s%20NVBLAS%20library%2C%20and%20you%20can%20read%20about%20how%20to%20configure%20it%20in%20netlib-java%20here.%20However%2C%20I%20don%u2019t%20recommend%20going%20this%20route%20for%20Apache%20Spark.%20If%20you%20take%20a%20look%20at%20section%203%20of%20the%20NVBLAS%20documentation%2C%20you%u2019ll%20see%20that%20only%20a%20handful%20of%20BLAS3%20routines%20get%20sent%20to%20the%20GPU%3A%20gemm%2C%20syrk%2C%20herk%2C%20syr2k%2C%20her2k%2C%20trsm%2C%20trmm%2C%20symm%2C%20and%20hemm.%20Of%20these%2C%20only%20gemm%20is%20available%20in%20Spark%u2019s%20MLlib%2C%20and%20only%20if%20explicitly%20used%20by%20the%20end%20user%3B%20MLlib%20doesn%u2019t%20make%20use%20of%20any%20of%20these%20internally.%20This%20means%20that%20all%20other%20BLAS%20calls%20will%20get%20directed%20to%20the%20backup%20implementation%20that%20you%20have%20defined%20in%20your%20nvblas%20configuration%20file%2C%20and%20won%u2019t%20run%20on%20the%20GPU.%20If%20you%u2019re%20on%20a%20RedHat%20system%20like%20me%2C%20this%20means%20you%u2019ll%20likely%20be%20building%20and%20using%20the%20reference%20CBLAS%20library%2C%20and%20not%20using%20a%20better-performing%20library%20such%20as%20OpenBLAS.%20So%20unless%20you%20have%20an%20application%20that%20sends%20a%20large%20amount%20of%20data%20through%20explicit%20calls%20to%20Matrix.multiply%28y%3A%20DenseMatrix%29%2C%20you%u2019re%20not%20gaining%20anything%20for%20your%20efforts%20to%20get%20cuBLAS%20working%20in%20Spark.%0A%0A%0A%21%5BAlt%20text%5D%28./1504706262191.png%29%0A%0A%0ADiscussion%3A%20In%20Spark%2C%20the%20job%20of%20multiplication%20on%20dis-%0Atributed%20matrix%20is%20divided%20into%20a%20number%20of%20tasks%2C%20where%20each%20task%20calculates%20the%20multiplication%20of%20small%20block%20matrices.%20The%0Adefault%20size%20of%20each%20block%20is%201024%20%D7%201024.%20The%20number%20of%20tasks%20generated%20by%20Spark%20increases%20as%20the%20matrix%20size%20increases.%20As%20the%20number%20of%20tasks%20increases%2C%20data%20shuffle%20overhead%20increases%2C%20diminishing%20advantages%20of%20leveraging%20the%20advan-%20tage%20of%20hardware%20accelerators%20for%20performing%20high%20throughput%20computations.%20However%2C%20even%20with%20this%20shuffle%20and%20framework%20overhead%2C%20hardware%20acceleration%20can%20still%20speed-up%20overall%0Aperformance%20up%20to%20more%20than%202%D7%20as%20shown%20in%20Section%20IV-A.%20On%20the%20flip%20side%2C%20we%20find%20that%20for%20matrix%20with%20small%20sizes%2C%20OpenBLAS%20performs%20similar%20to%20or%20faster%20than%20NVBLAS%2C%20but%0Afor%20the%2066331%D766331%20matrix%2C%20NVBLAS%20actually%20spends%20less%20time%20on%20computation%20than%20OpenBLAS.%20Thus%2C%20we%20expect%20the%20benefits%20from%20utilizing%20high-throughput%20accelerators%20such%20as%20GPUs%20to%20grow%20with%20growing%20matrix%20sizes.%0A%0ASince%20Spark%20is%20not%20aware%20of%20any%20GPU%20devices%2C%20it%20assumes%20that%20each%20CPU%20core%20is%20performing%20multiplications%20assigned%20to%20it%20independently%20of%20the%20other%20ones.%20But%20in%20actuality%2C%20all%20CPU%20cores%20are%20delegating%20their%20matrix%20multiplication%20tasks%20to%20a%20single%20GPU%20device.%20This%20might%20lead%20to%20a%20queue%20for%20tasks%20incoming%20to%20the%20GPU%20device%2C%20and%20ultimately%20means%20that%20Spark%20is%20making%20decisions%20regarding%20the%20delegation%20of%20workload%20and%20the%20capabilities%20of%20its%20resources%20on%20false%20assumptions.%0AFinally%2C%0A%0AOpenBLAS%uAC00%20%uC788%uB098%3F%20ubuntu%20%uC774%uBBF8%uC9C0%uC5D0%20%uC5C6%uB2E4%0A%0A%21%5BAlt%20text%5D%28./1504714687701.png%29%0A%0A%0A%21%5BAlt%20text%5D%28./1504714703109.png%29%0A%21%5BAlt%20text%5D%28./1504714936239.png%29%0A%0A%21%5BAlt%20text%5D%28./1504755978304.png%29%0A

저작자표시 (새창열림)

MJay

NVBLAS 논문 본문

NVBLAS 논문

NVBLAS 논문

Fatman vs LittleBoy

티스토리툴바