Notice

Recent Posts

Recent Comments

Link

MJay

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

MJay

Matrix multiplication on GPU using CUDA with CUBLAS, CURAND and Thrust | Solarian Programmer 사본 본문

Cloud Computing/Machine Learning

Matrix multiplication on GPU using CUDA with CUBLAS, CURAND and Thrust | Solarian Programmer 사본

MJSon 2017. 10. 19. 13:49

Matrix multiplication on GPU using CUDA with CUBLAS, CURAND and Thrust

Posted on May 31, 2012 by Sol

The code for this tutorial is on GitHub: https://github.com/sol-prog/cuda_cublas_curand_thrust.

Matrix multiplication is an essential building block for numerous numerical algorithms, for this reason most numerical libraries implements matrix multiplication. One of the oldest and most used matrix multiplication implementation GEMM is found in the BLAS library. While the reference BLAS implementation is not particularly fast there are a number of third party optimized BLAS implementations like MKL from Intel, ACML from AMD or CUBLAS from NVIDIA.

In this post I’m going to show you how you can multiply two arrays on a CUDA device with CUBLAS. A typical approach to this will be to create three arrays on CPU (the host in CUDA terminology), initialize them, copy the arrays on GPU (the device on CUDA terminology), do the actual matrix multiplication on GPU and finally copy the result on CPU. Our first example will follow the above suggested algorithm, in a second example we are going to significantly simplify the low level memory manipulation required by CUDA using Thrust which aims to be a replacement for the C++ STL on GPU.

Let’s start by allocating space for our three arrays on CPU and GPU:

 1 #include <cstdlib>
 2 
 3 int main() {
 4     // Allocate 3 arrays on CPU
 5     int nr_rows_A, nr_cols_A, nr_rows_B, nr_cols_B, nr_rows_C, nr_cols_C;
 6 
 7     // for simplicity we are going to use square arrays
 8     nr_rows_A = nr_cols_A = nr_rows_B = nr_cols_B = nr_rows_C = nr_cols_C = 3;
 9     
10     float *h_A = (float *)malloc(nr_rows_A * nr_cols_A * sizeof(float));
11     float *h_B = (float *)malloc(nr_rows_B * nr_cols_B * sizeof(float));
12     float *h_C = (float *)malloc(nr_rows_C * nr_cols_C * sizeof(float));
13 
14     // Allocate 3 arrays on GPU
15     float *d_A, *d_B, *d_C;
16     cudaMalloc(&d_A,nr_rows_A * nr_cols_A * sizeof(float));
17     cudaMalloc(&d_B,nr_rows_B * nr_cols_B * sizeof(float));
18     cudaMalloc(&d_C,nr_rows_C * nr_cols_C * sizeof(float));
19 
20     // ....
21 
22     //Free GPU memory
23     cudaFree(d_A);
24     cudaFree(d_B);
25     cudaFree(d_C);  
26 
27     // Free CPU memory
28     free(h_A);
29     free(h_B);
30     free(h_C);    
31 
32     return 0;
33 }

Please note the way in which we allocate memory for data on CPU using C’s malloc (line 10) and GPU using CUDA’s cudaMalloc (line 16), at the end of the main function we can free the device memory with cudaFree. The above code uses d_ for prefixing the arrays located on GPU and h_ for the host (CPU) arrays. We use 3x3 arrays in this example for simplicity, in a real application you should use much larger arrays for using the device efficiently.

Next step is to initialize the arrays A and B with some values, we are going to fill them with random numbers on the GPU using the CURAND library. In order to do this we could write a function that receives as input a device array, after the function is executed the array will be filled with random nubers:

 1 ...
 2 #include <curand.h>
 3 ...
 4 
 5 // Fill the array A(nr_rows_A, nr_cols_A) with random numbers on GPU
 6 void GPU_fill_rand(float *A, int nr_rows_A, int nr_cols_A) {
 7     // Create a pseudo-random number generator
 8     curandGenerator_t prng;
 9     curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_DEFAULT);
10 
11     // Set the seed for the random number generator using the system clock
12     curandSetPseudoRandomGeneratorSeed(prng, (unsigned long long) clock());
13 
14     // Fill the array with random numbers on the device
15     curandGenerateUniform(prng, A, nr_rows_A * nr_cols_A);
16 }
17 
18 ...

If the arrays A and B are already filled with meaningful data, on CPU, or you simply want to fill them with random data on CPU and avoid the use of CURAND, you will need to transfer the host arrays to GPU:

1     cudaMemcpy(d_A,h_A,nr_rows_A * nr_cols_A * sizeof(float),cudaMemcpyHostToDevice);
2     cudaMemcpy(d_B,h_B,nr_rows_B * nr_cols_B * sizeof(float),cudaMemcpyHostToDevice);

As a side note, BLAS uses internally column-major order storage (Fortran order) and not the typical row-major storage used in C or C++ for multidimensional arrays.

Now that the arrays A and B are initialized and transfered on GPU we could write a function that will do the actual multiplication:

 1 ...
 2 #include <cublas_v2.h>
 3 ...
 4 
 5 // Multiply the arrays A and B on GPU and save the result in C
 6 // C(m,n) = A(m,k) * B(k,n)
 7 void gpu_blas_mmul(const float *A, const float *B, float *C, const int m, const int k, const int n) {
 8     int lda=m,ldb=k,ldc=m;
 9     const float alf = 1;
10     const float bet = 0;
11     const float *alpha = &alf;
12     const float *beta = &bet;
13 
14     // Create a handle for CUBLAS
15     cublasHandle_t handle;
16     cublasCreate(&handle);
17 
18     // Do the actual multiplication
19     cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
20 
21     // Destroy the handle
22     cublasDestroy(handle);
23 }

Observation - If you need to do more than one matrix multiplication in your code it is advisable to move the create/destroy handle code (lines 15 - 16 and 22) from the above function in the main function, and use the same handle for all multiplications. In this case the gpu_blas_mmul function became:

 1 void gpu_blas_mmul(cublasHandle_t &handle, const float *A, const float *B, float *C, const int m, const int k, const int n) {
 2     int lda=m,ldb=k,ldc=m;
 3     const float alf = 1;
 4     const float bet = 0;
 5     const float *alpha = &alf;
 6     const float *beta = &bet;
 7 
 8     // Do the actual multiplication
 9     cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
10 }

We can also implement an utility function for printing the result of the multiplication on the standard output:

 1 //Print matrix A(nr_rows_A, nr_cols_A) storage in column-major format
 2 void print_matrix(const float *A, int nr_rows_A, int nr_cols_A) {
 3 
 4     for(int i = 0; i < nr_rows_A; ++i){
 5         for(int j = 0; j < nr_cols_A; ++j){
 6             std::cout << A[j * nr_rows_A + i] << " ";
 7         }
 8         std::cout << std::endl;
 9     }
10     std::cout << std::endl;
11 }

Using the above pieces we present here for reference the complete main function:

 1 int main() {
 2     // Allocate 3 arrays on CPU
 3     int nr_rows_A, nr_cols_A, nr_rows_B, nr_cols_B, nr_rows_C, nr_cols_C;
 4 
 5     // for simplicity we are going to use square arrays
 6     nr_rows_A = nr_cols_A = nr_rows_B = nr_cols_B = nr_rows_C = nr_cols_C = 3;
 7     
 8     float *h_A = (float *)malloc(nr_rows_A * nr_cols_A * sizeof(float));
 9     float *h_B = (float *)malloc(nr_rows_B * nr_cols_B * sizeof(float));
10     float *h_C = (float *)malloc(nr_rows_C * nr_cols_C * sizeof(float));
11 
12     // Allocate 3 arrays on GPU
13     float *d_A, *d_B, *d_C;
14     cudaMalloc(&d_A,nr_rows_A * nr_cols_A * sizeof(float));
15     cudaMalloc(&d_B,nr_rows_B * nr_cols_B * sizeof(float));
16     cudaMalloc(&d_C,nr_rows_C * nr_cols_C * sizeof(float));
17 
18     // Fill the arrays A and B on GPU with random numbers
19     GPU_fill_rand(d_A, nr_rows_A, nr_cols_A);
20     GPU_fill_rand(d_B, nr_rows_B, nr_cols_B);
21 
22     // Optionally we can copy the data back on CPU and print the arrays
23     cudaMemcpy(h_A,d_A,nr_rows_A * nr_cols_A * sizeof(float),cudaMemcpyDeviceToHost);
24     cudaMemcpy(h_B,d_B,nr_rows_B * nr_cols_B * sizeof(float),cudaMemcpyDeviceToHost);
25     std::cout << "A =" << std::endl;
26     print_matrix(h_A, nr_rows_A, nr_cols_A);
27     std::cout << "B =" << std::endl;
28     print_matrix(h_B, nr_rows_B, nr_cols_B);
29 
30     // Multiply A and B on GPU
31     gpu_blas_mmul(d_A, d_B, d_C, nr_rows_A, nr_cols_A, nr_cols_B);
32 
33     // Copy (and print) the result on host memory
34     cudaMemcpy(h_C,d_C,nr_rows_C * nr_cols_C * sizeof(float),cudaMemcpyDeviceToHost);
35     std::cout << "C =" << std::endl;
36     print_matrix(h_C, nr_rows_C, nr_cols_C);
37 
38     //Free GPU memory
39     cudaFree(d_A);
40     cudaFree(d_B);
41     cudaFree(d_C);  
42 
43     // Free CPU memory
44     free(h_A);
45     free(h_B);
46     free(h_C);
47 
48     return 0;
49 }

For compiling the above example open a Terminal (Mac and Linux) or a Visual Studio Command Prompt window (for Windows) and write:

1 nvcc mmul_1.cu -lcublas -lcurand -o mmul_1

This is the result of running the code on my Mac:

 1 sols-MacBook-Pro:Desktop sol$ nvcc mmul_1.cu -lcublas -lcurand -o mmul_1
 2 sols-MacBook-Pro:Desktop sol$ ./mmul_1
 3 A =
 4 0.365931 0.887901 0.48295 
 5 0.610939 0.453606 0.978245 
 6 0.626738 0.275708 0.804269 
 7 
 8 B =
 9 0.975249 0.504392 0.520455 
10 0.423145 0.328515 0.128649 
11 0.767074 0.0733948 0.217057 
12 
13 C =
14 1.10304 0.511707 0.409506 
15 1.53815 0.528967 0.588657 
16 1.34482 0.465725 0.53623 
17 
18 sols-MacBook-Pro:Desktop sol$

As mentioned in the introduction, we could use Thrust to further simplify the above code. For example we could avoid completely the need to manually manage memory on the host and device using a Thrust vector for storing our data. Reimplementing the above example with Thrust will halve the number of lines of code from the main function:

 1 int main() {
 2     // Allocate 3 arrays on CPU
 3     int nr_rows_A, nr_cols_A, nr_rows_B, nr_cols_B, nr_rows_C, nr_cols_C;
 4 
 5     // for simplicity we are going to use square arrays
 6     nr_rows_A = nr_cols_A = nr_rows_B = nr_cols_B = nr_rows_C = nr_cols_C = 3;
 7 
 8     thrust::device_vector<float> d_A(nr_rows_A * nr_cols_A), d_B(nr_rows_B * nr_cols_B), d_C(nr_rows_C * nr_cols_C);
 9 
10     // Fill the arrays A and B on GPU with random numbers
11     GPU_fill_rand(thrust::raw_pointer_cast(&d_A[0]), nr_rows_A, nr_cols_A);
12     GPU_fill_rand(thrust::raw_pointer_cast(&d_B[0]), nr_rows_B, nr_cols_B);
13 
14     // Optionally we can print the data
15     std::cout << "A =" << std::endl;
16     print_matrix(d_A, nr_rows_A, nr_cols_A);
17     std::cout << "B =" << std::endl;
18     print_matrix(d_B, nr_rows_B, nr_cols_B);
19 
20     // Multiply A and B on GPU
21     gpu_blas_mmul(thrust::raw_pointer_cast(&d_A[0]), thrust::raw_pointer_cast(&d_B[0]), thrust::raw_pointer_cast(&d_C[0]), nr_rows_A, nr_cols_A, nr_cols_B);
22 
23     //Print the result
24     std::cout << "C =" << std::endl;
25     print_matrix(d_C, nr_rows_C, nr_cols_C);
26 
27     return 0;
28 }

Please note in the above highlighted lines the way in which we convert a Thrust device_vector to a CUDA device pointer.

If you are interested in learning CUDA, I would recommend reading CUDA Application Design and Development by Rob Farber.

or CUDA by Example: An Introduction to General-Purpose GPU Programming by J. Sanders and E. Kandrot.

dWelcome to Disqus! Discover more great discussions just like this one. We're a lot more than comments.Get Started

Dismiss ×

Join the discussion…

Attach

Pantelis Sopasakis • 3 years ago
Very nice tutorial! I've one question: Can one use Thrust to perform lots of matrix-vector multiplications in parallel?
- −
- SolarianProgrammer Mod > Pantelis Sopasakis • 3 years ago
  Yes.
  - −
  - Pantelis Sopasakis > SolarianProgrammer • 3 years ago
    
    Some hints? :)
    - −
    - SolarianProgrammer Mod > Pantelis Sopasakis • 3 years ago
      
      If I understood correctly how thrust works, the multiplications should be non-blocking (asynchronous), if your CUDA device has enough juice (threads available) they will run in parallel. You can also ask this on their forums (or mailing lists) to be sure.
      
      −
Leonardo Miquelutti • 3 years ago
Considering an implementation of thrust::complex type, how would you do to summon cublasCgemm, which is the cuComplex version of cublasSgemm? I mean, how would you cast between thrust::complex and cuComplex?
- −
Marc Maetz • 3 years ago
Awesome tutorial!
- −
Carolyn Phillips • 3 years ago
Very nice example! However, your print_matrix code does not work with the thrust additions. Doesn't the data need to be moved back to the host by copying to a host_vector before it can be printed out?
- −
- SolarianProgrammer Mod > Carolyn Phillips • 3 years ago
  When you work with thrust vectors you can access directly the vector elements (no matter if the vector is on the CPU or on the GPU).
  - −
  - Carolyn Phillips > SolarianProgrammer • 3 years ago
    
    I got Bus Errors until I added code that looked like as follows
    thrust::host_vector<float> h_A = d_A;
    print_matrix(thrust::raw_pointer_cast(&h_A[0]), nr_rows_A, nr_cols_A);
    - −
    - SolarianProgrammer Mod > Carolyn Phillips • 3 years ago
      
      Thanks for the follow up,
      maybe your answer will help other people with the same problem.
      On my machine I was able to print directly elements from d_A.
      
      −
Linxi Chen • 4 years ago
This is very useful. Thanks!
- −
Mike • 5 years ago
one minor problem. Shouldn't the seed move out the GPU_fill_rand() function?
- −
- SolarianProgrammer Mod > Mike • 5 years ago
  You can move it outside if you want, as it is implemented now it will reinitialized the seed for every call, twice in this case.
  - −
Ishayu • 5 years ago
I was looking all over the web for an easy to understand matrix multiplication example using CUBLAS. It was almost hopeless, but you got it. Thanks!
- −
Miguel • 5 years ago
Thanks a lot for these tutorials, they are very helpful !
- −
Jonathan • 5 years ago
Did you mean to say "CPU" for the first comment?
- −
- SolarianProgrammer Mod > Jonathan • 5 years ago
  Yes, definitely CPU :). Thanks.
  - −

저작자표시

'Cloud Computing > Machine Learning' 카테고리의 다른 글

What is Machine Learning (0)	2019.03.29
WSDM - LGBM GPU Verison 되게 하기 (0)	2017.11.30
딥러닝 손실함수 MSE(Mean Squared Error), CEE(Cross Entropy Error) (0)	2017.10.17
제 2회 정보과학회 DB소사이어티 겨울학교 후기 (0)	2017.02.27

'Cloud Computing/Machine Learning' Related Articles