What is the difference between cuda vs tensor cores?

CudaGpuNvidia

Cuda Problem Overview


I am completely new to terms related to HPC computing, but I just saw that EC2 released its new type of instance on AWS that's powered by the new Nvidia Tesla V100, which has both kinds of "cores": Cuda Cores (5,120) and Tensor Cores (640). What is the difference between both?

Cuda Solutions


Solution 1 - Cuda

Now only Tesla V100 and Titan V have tensor cores. Both GPUs have 5120 cuda cores where each core can perform up to 1 single precision multiply-accumulate operation (e.g. in fp32: x += y * z) per 1 GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz).

Each tensor core perform operations on small matrices with size 4x4. Each tensor core can perform 1 matrix multiply-accumulate operation per 1 GPU clock. It multiplies two fp16 matrices 4x4 and adds the multiplication product fp32 matrix (size: 4x4) to accumulator (that is also fp32 4x4 matrix).

It is called mixed precision because input matrices are fp16 but multiplication result and accumulator are fp32 matrices.

Probably, the proper name would be just 4x4 matrix cores however NVIDIA marketing team decided to use "tensor cores".

Solution 2 - Cuda

GPU’s have always been good for machine learning. GPU cores were originally designed for physics and graphics computation, which involves matrix operations. General computing tasks do not require lots of matrix operations, so CPU’s are much slower at these. Physics and graphics are also far easier to parallelise than general computing tasks, leading to the high core count.

Due to the matrix heavy nature of machine learning (neural nets), GPU’s were a great fit. Tensor cores are just more heavily specialised to the types of computation involved in machine learning software (such as Tensorflow).

Nvidia have written a detailed blog here, which goes into far more detail on how Tensor cores work and the preformance improvements over CUDA cores.

Solution 3 - Cuda

CUDA cores:

Does a single value multiplication per one GPU clock

1 x 1 per GPU clock

TENSOR cores:

Does a matrix multiplication per one GPU clock

[1 1 1       [1 1 1
 1 1 1   x    1 1 1    per GPU clock
 1 1 1]       1 1 1]

To be more precise TENSOR core does the computation of many CUDA cores in the same time.

Solution 4 - Cuda

Tensor cores use a lot less computation power at the expense of precision than Cuda Cores, but that loss of precision doesn't have that much effect on the final output.

This is why for Machine Learning models, Tensor Cores are more effective at cost reduction without changing the output that much.

Google itself uses the Tensor Processing Units for google translate.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSimon Ernesto Cardenas ZarateView Question on Stackoverflow
Solution 1 - CudaArturView Answer on Stackoverflow
Solution 2 - CudaMikeS159View Answer on Stackoverflow
Solution 3 - CudaSundar SanthanamView Answer on Stackoverflow
Solution 4 - Cudapranshu vinayakView Answer on Stackoverflow