NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks

Efficiency criteria are an informative means to contrast brand-new items on the marketplace. With a lot of GPUs offered, it can be hard to analyze which appropriate to your requirements. Numerous criteria supply details to contrast efficiency on specific formulas or procedures. Given that there are a lot of various formulas to select from, there is no lack of benchmarking collections offered.

For this contrast, the SHOC criteria collection (https://github.com/vetter/shoc/) is utilized to contrast the efficiency of the NVIDIA Tesla T4 with various other GPUs generally utilized for clinical computer: the NVIDIA Tesla P100 as well as Tesla V100

The Scalable Heterogeneous Computer Standard Collection (SHOC) is a collection of benchmark programs checking the efficiency as well as security of systems utilizing calculating tools with non-traditional designs for basic function computer, as well as the software application utilized to set them. Its preliminary emphasis gets on systems including Video Handling Devices (GPUs) as well as multi-core cpus, as well as on the OpenCL shows requirement. It can be utilized on collections in addition to specific hosts.

The SHOC criteria collection consists of alternatives for lots of criteria appropriate to a selection of clinical calculations. The majority of the criteria are supplied in both solitary- as well as double-precision as well as with as well as without PCIE transfer factor to consider. This suggests that for each and every examination there depend on 4 outcomes for each and every criteria. These criteria are arranged right into 3 degrees as well as can be run separately or completely.

The Tesla P100 as well as V100 GPUs are reputable accelerators for HPC as well as AI work. They generally supply the greatest efficiency, eat one of the most power (250 ~300 W), as well as have the greatest cost (~$10 k). The Tesla T4 is a brand-new item based upon the most recent “Turing” style, providing raised performance in addition to brand-new functions. Nevertheless, it is not a substitute for the bigger/more power-hungry GPUs. Rather, it provides excellent efficiency while taking in much much less power (70 W) at a reduced rate (~$ 2.5 k). You’ll intend to make use of the appropriate device for the task, which will certainly rely on your work( s). A recap of each Tesla GPU is revealed listed below.

Tesla V100– Globe’s The majority of Advanced Datacenter GPU, for AI & HPC

Integrated in Microway NumberSmasher as well as OpenPOWER GPU Servers & GPU Collections

RequirementsTesla V100 SXM 2.0 GPU

  • Approximately 7.8 TFLOPS dual- as well as 15.7 TFLOPS single-precision floating-point efficiency
  • Approximately 125 TensorTFLOPS of Deep Knowing Efficiency
  • NVIDIA Volta ™ GPU style
  • 5120 CUDA cores, 620 Tensor Cores
  • 16 GB or 32 GB of on-die HBM2 GPU memory
  • Memory transmission capacity as much as 900 GB/s
  • NVIDIA NVLink ™ or PCI-E x16 Gen3 user interface to system
  • Offered with improved NVLink user interface, with 300 GB/sec bi-directional transmission capacity to the GPU
  • Easy heatsink just, appropriate for specially-designed GPU web servers

Tesla T4– Price/performance for AI as well as Solitary Accuracy

Integrated in Microway NumberSmasher as well as Navion GPU Servers & GPU Collections

Requirements

  • Approximately 8.1 TFLOPS single-precision floating-point efficiency
  • Approximately 65 TensorTFLOPS of Deep Knowing Training Efficiency; 260 INT4 TOPS of Reasoning Efficiency
  • NVIDIA “Turing” TU104 graphics refining device (GPU)
  • 2560 CUDA cores, 320 Tensor Cores
  • 16 GB of GDDR6 GPU memory
  • Memory transmission capacity as much as 320 GB/s
  • PCI-E x16 Gen3 user interface to system
  • Easy heatsink just, appropriate for specially-designed GPU web servers

Tesla P100– Solid Efficiency as well as Connection for HPC or AI

Integrated in Microway NumberSmasher as well as OpenPOWER GPU Servers & GPU Collections

RequirementsTesla P100 Socketed GPU

  • Approximately 5.3 TFLOPS dual- as well as 10.6 TFLOPS single-precision floating-point efficiency
  • NVIDIA “Pascal” General Practitioner100 graphics refining device (GPU)
  • 3584 CUDA cores
  • 12 GB or 16 GB of on-die HBM2 CoWoS GPU memory
  • Memory transmission capacity as much as 732 GB/s
  • NVLink or PCI-E x16 Gen3 user interface to system
  • Easy heatsink just, appropriate for specially-designed GPU web servers

In our screening, both solitary- as well as double-precision SHOC criteria were run, which permits us to make a straight contrast of the abilities of each GPU. A couple of HPC-relevant criteria were chosen to contrast the T4 to the P100 as well as V100 Tesla P100 is based upon the “Pascal” style, which supplies typical CUDA cores. Tesla V100 includes the “Volta” style, which presented deep-learning particular TensorCores to enhance CUDA cores. Tesla T4 has NVIDIA’s “Turing” style, that includes TensorCores as well as CUDA cores (heavy in the direction of single-precision). This item was made largely with artificial intelligence in mind, which leads to greater single-precision efficiency as well as reasonably reduced double-precision efficiency. Listed below, a few of the commonly-used HPC criteria are contrasted side-by-side for the 3 GPUs.

Dual Accuracy Outcomes

GPU Tesla T4 Tesla V100 Tesla P100
Max Flops (GFLOPS)25338707286473676
Quick Fourier Transform (GFLOPS)1326011487575629
Matrix Reproduction (GFLOPS)24957592001425608
Molecular Characteristics (GFLOPS)105269086240296
S3D (GFLOPS)59972278516154

Solitary Accuracy Outcomes

GPU Tesla T4 Tesla V100 Tesla P100
Max Flops (GFLOPS)8073261401650932246
Quick Fourier Transform (GFLOPS)66005230132151049
Matrix Reproduction (GFLOPS)3290941348040879333
Molecular Characteristics (GFLOPS)572919976148002
S3D (GFLOPS)99424347829520

The single-precision outcomes reveal Tesla T4 carrying out well for its dimension, though it drops brief in dual accuracy contrasted to the NVIDIA Tesla V100 as well as Tesla P100 GPUs. Applications that call for double-precision precision are not matched to the Tesla T4. Nevertheless, the solitary accuracy efficiency goes over as well as bodes well for the efficiency of applications that are maximized for reduced or blended accuracy.

Plot comparing the performance of Tesla T4 with the Tesla P100 and Tesla V100 GPUs

To clarify the single-precision criteria revealed over:

  • Limit Flops for the T4 are excellent contrasted to V100 as well as affordable with P100 Tesla T4 supplies majority as lots of FLOPS as V100 as well as greater than 80% of P100
  • The T4 reveals outstanding efficiency in the Molecular Characteristics criteria (an n-body pairwise calculation utilizing the Lennard-Jones capacity). It once more provides majority the efficiency of Tesla V100, while defeating the Tesla P100
  • In the Quick Fourier Transform (FFT) as well as Matrix Reproduction criteria, the efficiency of Tesla T4 gets on the same level for both price/performance as well as power/performance (one 4th the efficiency of V100 for one 4th the rate as well as one 4th the electrical power). This shows exactly how the T4 will certainly do in a a great deal of HPC applications.
  • For S3D, the T4 falls back by a couple of extra percent.

Checking out these outcomes, it is very important to bear in mind the context. Tesla T4 eats just ~25% the electrical power of the bigger Tesla GPUs as well as expenses just ~25% as a lot. It is likewise a literally smaller sized GPU that can be set up in a larger selection of web servers as well as calculate nodes. Because context, the Tesla T4 holds its very own as an effective alternative for a sensible rate when contrasted to the bigger NVIDIA Tesla GPUs.

Cost-efficient Artificial Intelligence

The T4 has significant single/mixed accuracy artificial intelligence concentrated efficiency, with a price considerably less than bigger Tesla GPUs. What the T4 does not have in dual accuracy, it offsets with outstanding single-precision outcomes. The single-precision efficiency offered will highly satisfy the maker discovering formulas with prospective to be put on blended accuracy. Future job will certainly analyze this element much more carefully, yet Tesla T4 is anticipated to be of high passion for deep knowing reasoning as well as to have particular use-cases for deep knowing training.

Outstanding Single-Precision HPC Efficiency

In the molecular characteristics criteria, the T4 surpasses the Tesla P100 GPU. This is very outstanding, as well as for those curious about solitary- or mixed-precision estimations including comparable formulas, the T4 can supply an exceptional remedy. With some adjusting formulas, the T4 might be a solid competitor for clinical applications that likewise intend to use artificial intelligence abilities to examine outcomes or run a selection of various sorts of formulas from both artificial intelligence as well as clinical computer on a quickly obtainable GPU.

Along with the straight-out reduced cost, the T4 likewise runs at 70 Watts, in contrast to the 250+ Watts needed for the Tesla P100/ V100 GPUs. Working on one quarter of the power suggests that it is both less expensive to buy as well as less expensive to run.

If it shows up the brand-new Tesla T4 will certainly increase your work, yet you wish to criteria, please register to Check Drive for on your own. We likewise welcome you to call among our specialists to review your requirements better. Our objective is to comprehend your demands, supply support on finest alternatives, as well as see the job with to effective system/cluster implementation.

— Invite To The SHOC Standard Collection variation 1.1.5 —
Hostname: node9
System choice not defined, default to system # 0
Variety of offered systems: 1
Variety of offered tools on system 0: 4
Tool 0: ‘ Tesla T4 ’
Tool 1: ‘ Tesla T4 ’
Tool 2: ‘ Tesla T4 ’
Tool 3: ‘ Tesla T4 ’
Tool choice not defined: skipping to gadget # 0.
Making use of dimension course: 4

— Beginning Benchmarks —
Running benchmark BusSpeedDownload
outcome for bspeed_download: 12.3585 GB/sec
Running benchmark BusSpeedReadback
outcome for bspeed_readback: 13.2077 GB/sec
Running benchmark MaxFlops
outcome for maxspflops: 8073.2600 GFLOPS
outcome for maxdpflops: 253.3760 GFLOPS
Running benchmark DeviceMemory
outcome for gmem_readbw: 215.2640 GB/s
outcome for gmem_readbw_strided: 109.2370 GB/s
outcome for gmem_writebw: 201.0440 GB/s
outcome for gmem_writebw_strided: 29.2783 GB/s
outcome for lmem_readbw: 3435.8600 GB/s
outcome for lmem_writebw: 3704.9400 GB/s
outcome for tex_readbw: 884.0470 GB/sec
Avoiding non-cuda criteria KernelCompile
Avoiding non-cuda criteria QueueDelay
Running benchmark BFS
outcome for bfs: 6.3894 GB/s
outcome for bfs_pcie: 3.8521 GB/s
outcome for bfs_teps: 344078000.0000 Edges/s
Running benchmark FFT
outcome for fft_sp: 660.0520 GFLOPS
outcome for fft_sp_pcie: 62.5926 GFLOPS
outcome for ifft_sp: 657.7220 GFLOPS
outcome for ifft_sp_pcie: 62.6273 GFLOPS
outcome for fft_dp: 132.5970 GFLOPS
outcome for fft_dp_pcie: 27.4628 GFLOPS
outcome for ifft_dp: 125.4250 GFLOPS
outcome for ifft_dp_pcie: 27.1584 GFLOPS
Running benchmark GEMM
outcome for sgemm_n: 3290.9400 GFlops
outcome for sgemm_t: 3287.4400 GFlops
outcome for sgemm_n_pcie: 2377.5600 GFlops
outcome for sgemm_t_pcie: 2375.7400 GFlops
outcome for dgemm_n: 249.5690 GFlops
outcome for dgemm_t: 249.6800 GFlops
outcome for dgemm_n_pcie: 227.2710 GFlops
outcome for dgemm_t_pcie: 227.3630 GFlops
Running benchmark MD
outcome for md_sp_flops: 572.9100 GFLOPS
outcome for md_sp_bw: 439.0600 GB/s
outcome for md_sp_flops_pcie: 53.9088 GFLOPS
outcome for md_sp_bw_pcie: 41.3140 GB/s
outcome for md_dp_flops: 105.2590 GFLOPS
outcome for md_dp_bw: 141.2860 GB/s
outcome for md_dp_flops_pcie: 37.2010 GFLOPS
outcome for md_dp_bw_pcie: 49.9335 GB/s
Running benchmark MD5Hash
outcome for md5hash: 14.8551 GHash/s
Running benchmark NeuralNet
outcome for nn_learning: BenchmarkError
outcome for nn_learning_pcie: BenchmarkError
Running benchmark Decrease
outcome for decrease: 225.9420 GB/s
outcome for reduction_pcie: 11.6754 GB/s
outcome for reduction_dp: 257.2570 GB/s
outcome for reduction_dp_pcie: 11.7360 GB/s
Running benchmark Check
outcome for check: 81.0464 GB/s
outcome for scan_pcie: 5.8949 GB/s
outcome for scan_dp: 62.2882 GB/s
outcome for scan_dp_pcie: 5.7605 GB/s
Running benchmark Type
outcome for type: 6.3951 GB/s
outcome for sort_pcie: 3.1917 GB/s
Running benchmark Spmv
outcome for spmv_csr_scalar_sp: 19.3042 Gflop/s
outcome for spmv_csr_scalar_sp_pcie: 2.5486 Gflop/s
outcome for spmv_csr_scalar_dp: 11.9228 Gflop/s
outcome for spmv_csr_scalar_dp_pcie: 1.7080 Gflop/s
outcome for spmv_csr_scalar_pad_sp: 24.5346 Gflop/s
outcome for spmv_csr_scalar_pad_sp_pcie: 2.6437 Gflop/s
outcome for spmv_csr_scalar_pad_dp: 14.4112 Gflop/s
outcome for spmv_csr_scalar_pad_dp_pcie: 1.7501 Gflop/s
outcome for spmv_csr_vector_sp: 51.6801 Gflop/s
outcome for spmv_csr_vector_sp_pcie: 2.7829 Gflop/s
outcome for spmv_csr_vector_dp: 35.7128 Gflop/s
outcome for spmv_csr_vector_dp_pcie: 1.8895 Gflop/s
outcome for spmv_csr_vector_pad_sp: 55.1641 Gflop/s
outcome for spmv_csr_vector_pad_sp_pcie: 2.8127 Gflop/s
outcome for spmv_csr_vector_pad_dp: 37.4158 Gflop/s
outcome for spmv_csr_vector_pad_dp_pcie: 1.8914 Gflop/s
outcome for spmv_ellpackr_sp: 37.6080 Gflop/s
outcome for spmv_ellpackr_dp: 27.4393 Gflop/s
Running benchmark Stencil2D
outcome for pattern: 218.0090 GFLOPS
outcome for stencil_dp: 100.4440 GFLOPS
Running benchmark Set of three
outcome for triad_bw: 16.2555 GB/s
Running benchmark S3D
outcome for s3d: 99.4160 GFLOPS
outcome for s3d_pcie: 86.6513 GFLOPS
outcome for s3d_dp: 56.9674 GFLOPS
outcome for s3d_dp_pcie: 48.7782 GFLOPS

— Invite To The SHOC Standard Collection variation 1.1.5 —
Hostname: node6
System choice not defined, default to system # 0
Variety of offered systems: 1
Variety of offered tools on system 0: 4
Tool 0: ‘ Tesla V100- PCIE-32 GB ’
Tool 1: ‘ Tesla V100- PCIE-32 GB ’
Tool 2: ‘ Tesla V100- PCIE-32 GB ’
Tool 3: ‘ Tesla V100- PCIE-32 GB ’
Defined 1 gadget IDs: 0
Making use of dimension course: 4

— Beginning Benchmarks —
Running benchmark BusSpeedDownload
outcome for bspeed_download: 12.3182 GB/sec
Running benchmark BusSpeedReadback
outcome for bspeed_readback: 13.2066 GB/sec
Running benchmark MaxFlops
outcome for maxspflops: 14016.5000 GFLOPS
outcome for maxdpflops: 7072.8600 GFLOPS
Running benchmark DeviceMemory
outcome for gmem_readbw: 795.4980 GB/s
outcome for gmem_readbw_strided: 430.5780 GB/s
outcome for gmem_writebw: 710.4180 GB/s
outcome for gmem_writebw_strided: 54.3789 GB/s
outcome for lmem_readbw: 8535.5600 GB/s
outcome for lmem_writebw: 9191.3800 GB/s
outcome for tex_readbw: 1368.0900 GB/sec
Avoiding non-cuda criteria KernelCompile
Avoiding non-cuda criteria QueueDelay
Running benchmark BFS
outcome for bfs: 10.2526 GB/s
outcome for bfs_pcie: 4.9526 GB/s
outcome for bfs_teps: 489112000.0000 Edges/s
Running benchmark FFT
outcome for fft_sp: 2301.3200 GFLOPS
outcome for fft_sp_pcie: 66.9615 GFLOPS
outcome for ifft_sp: 2283.8400 GFLOPS
outcome for ifft_sp_pcie: 67.0689 GFLOPS
outcome for fft_dp: 1148.7500 GFLOPS
outcome for fft_dp_pcie: 33.4412 GFLOPS
outcome for ifft_dp: 1138.6500 GFLOPS
outcome for ifft_dp_pcie: 33.4938 GFLOPS
Running benchmark GEMM
outcome for sgemm_n: 13480.4000 GFlops
outcome for sgemm_t: 13685.9000 GFlops
outcome for sgemm_n_pcie: 5231.6300 GFlops
outcome for sgemm_t_pcie: 5262.3000 GFlops
outcome for dgemm_n: 5920.0100 GFlops
outcome for dgemm_t: 5606.4400 GFlops
outcome for dgemm_n_pcie: 1774.8200 GFlops
outcome for dgemm_t_pcie: 1745.5500 GFlops
Running benchmark MD
outcome for md_sp_flops: 997.6080 GFLOPS
outcome for md_sp_bw: 764.5360 GB/s
outcome for md_sp_flops_pcie: 55.5554 GFLOPS
outcome for md_sp_bw_pcie: 42.5760 GB/s
outcome for md_dp_flops: 908.6200 GFLOPS
outcome for md_dp_bw: 1219.6100 GB/s
outcome for md_dp_flops_pcie: 53.7409 GFLOPS
outcome for md_dp_bw_pcie: 72.1343 GB/s
Running benchmark MD5Hash
outcome for md5hash: 31.3448 GHash/s
Running benchmark NeuralNet
outcome for nn_learning: BenchmarkError
outcome for nn_learning_pcie: BenchmarkError
Running benchmark Decrease
outcome for decrease: 293.9380 GB/s
outcome for reduction_pcie: 11.7540 GB/s
outcome for reduction_dp: 506.6470 GB/s
outcome for reduction_dp_pcie: 11.9523 GB/s
Running benchmark Check
outcome for check: 182.4320 GB/s
outcome for scan_pcie: 6.1221 GB/s
outcome for scan_dp: 185.5270 GB/s
outcome for scan_dp_pcie: 6.1331 GB/s
Running benchmark Type
outcome for type: 19.9312 GB/s
outcome for sort_pcie: 4.8228 GB/s
Running benchmark Spmv
outcome for spmv_csr_scalar_sp: 65.9282 Gflop/s
outcome for spmv_csr_scalar_sp_pcie: 2.7467 Gflop/s
outcome for spmv_csr_scalar_dp: 46.7535 Gflop/s
outcome for spmv_csr_scalar_dp_pcie: 1.9000 Gflop/s
outcome for spmv_csr_scalar_pad_sp: 72.0344 Gflop/s
outcome for spmv_csr_scalar_pad_sp_pcie: 2.8377 Gflop/s
outcome for spmv_csr_scalar_pad_dp: 54.4875 Gflop/s
outcome for spmv_csr_scalar_pad_dp_pcie: 1.9227 Gflop/s
outcome for spmv_csr_vector_sp: 153.1620 Gflop/s
outcome for spmv_csr_vector_sp_pcie: 2.8131 Gflop/s
outcome for spmv_csr_vector_dp: 109.5760 Gflop/s
outcome for spmv_csr_vector_dp_pcie: 1.9441 Gflop/s
outcome for spmv_csr_vector_pad_sp: 156.8750 Gflop/s
outcome for spmv_csr_vector_pad_sp_pcie: 2.8987 Gflop/s
outcome for spmv_csr_vector_pad_dp: 115.0560 Gflop/s
outcome for spmv_csr_vector_pad_dp_pcie: 1.9587 Gflop/s
outcome for spmv_ellpackr_sp: 76.6566 Gflop/s
outcome for spmv_ellpackr_dp: 65.7927 Gflop/s
Running benchmark Stencil2D
outcome for pattern: 595.8100 GFLOPS
outcome for stencil_dp: 339.2710 GFLOPS
Running benchmark Set of three
outcome for triad_bw: 16.4229 GB/s
Running benchmark S3D
outcome for s3d: 434.7830 GFLOPS
outcome for s3d_pcie: 263.8650 GFLOPS
outcome for s3d_dp: 227.8530 GFLOPS
outcome for s3d_dp_pcie: 136.3140 GFLOPS

— Invite To The SHOC Standard Collection variation 1.1.5 —
Hostname: node7
System choice not defined, default to system # 0
Variety of offered systems: 1
Variety of offered tools on system 0: 4
Tool 0: ‘ Tesla P100- PCIE-16 GB ’
Tool 1: ‘ Tesla P100- PCIE-16 GB ’
Tool 2: ‘ Tesla P100- PCIE-16 GB ’
Tool 3: ‘ Tesla P100- PCIE-16 GB ’
Defined 1 gadget IDs: 0
Making use of dimension course: 4

— Beginning Benchmarks —
Running benchmark BusSpeedDownload
outcome for bspeed_download: 12.3502 GB/sec
Running benchmark BusSpeedReadback
outcome for bspeed_readback: 13.2060 GB/sec
Running benchmark MaxFlops
outcome for maxspflops: 9322.4600 GFLOPS
outcome for maxdpflops: 4736.7600 GFLOPS
Running benchmark DeviceMemory
outcome for gmem_readbw: 574.4540 GB/s
outcome for gmem_readbw_strided: 98.2470 GB/s
outcome for gmem_writebw: 432.2270 GB/s
outcome for gmem_writebw_strided: 25.2659 GB/s
outcome for lmem_readbw: 4203.2000 GB/s
outcome for lmem_writebw: 5259.1000 GB/s
outcome for tex_readbw: 587.9750 GB/sec
Avoiding non-cuda criteria KernelCompile
Avoiding non-cuda criteria QueueDelay
Running benchmark BFS
outcome for bfs: 3.6904 GB/s
outcome for bfs_pcie: 2.6656 GB/s
outcome for bfs_teps: 208754000.0000 Edges/s
Running benchmark FFT
outcome for fft_sp: 1510.4900 GFLOPS
outcome for fft_sp_pcie: 66.1778 GFLOPS
outcome for ifft_sp: 1502.4700 GFLOPS
outcome for ifft_sp_pcie: 66.2629 GFLOPS
outcome for fft_dp: 756.2940 GFLOPS
outcome for fft_dp_pcie: 33.0865 GFLOPS
outcome for ifft_dp: 752.3340 GFLOPS
outcome for ifft_dp_pcie: 33.1221 GFLOPS
Running benchmark GEMM
outcome for sgemm_n: 8793.3300 GFlops
outcome for sgemm_t: 8882.6100 GFlops
outcome for sgemm_n_pcie: 4343.6700 GFlops
outcome for sgemm_t_pcie: 4365.3400 GFlops
outcome for dgemm_n: 4256.0800 GFlops
outcome for dgemm_t: 4389.7700 GFlops
outcome for dgemm_n_pcie: 1589.1300 GFlops
outcome for dgemm_t_pcie: 1607.4100 GFlops
Running benchmark MD
outcome for md_sp_flops: 480.0150 GFLOPS
outcome for md_sp_bw: 367.8690 GB/s
outcome for md_sp_flops_pcie: 52.8129 GFLOPS
outcome for md_sp_bw_pcie: 40.4741 GB/s
outcome for md_dp_flops: 402.9640 GFLOPS
outcome for md_dp_bw: 540.8830 GB/s
outcome for md_dp_flops_pcie: 50.0934 GFLOPS
outcome for md_dp_bw_pcie: 67.2385 GB/s
Running benchmark MD5Hash
outcome for md5hash: 14.6630 GHash/s
Running benchmark NeuralNet
outcome for nn_learning: BenchmarkError
outcome for nn_learning_pcie: BenchmarkError
Running benchmark Decrease
outcome for decrease: 257.3830 GB/s
outcome for reduction_pcie: 11.7287 GB/s
outcome for reduction_dp: 424.4240 GB/s
outcome for reduction_dp_pcie: 11.9433 GB/s
Running benchmark Check
outcome for check: 110.2530 GB/s
outcome for scan_pcie: 6.0040 GB/s
outcome for scan_dp: 131.8250 GB/s
outcome for scan_dp_pcie: 6.0633 GB/s
Running benchmark Type
outcome for type: 10.4056 GB/s
outcome for sort_pcie: 3.9523 GB/s
Running benchmark Spmv
outcome for spmv_csr_scalar_sp: 17.0055 Gflop/s
outcome for spmv_csr_scalar_sp_pcie: 2.4774 Gflop/s
outcome for spmv_csr_scalar_dp: 13.7115 Gflop/s
outcome for spmv_csr_scalar_dp_pcie: 1.7301 Gflop/s
outcome for spmv_csr_scalar_pad_sp: 21.3641 Gflop/s
outcome for spmv_csr_scalar_pad_sp_pcie: 2.6089 Gflop/s
outcome for spmv_csr_scalar_pad_dp: 16.0769 Gflop/s
outcome for spmv_csr_scalar_pad_dp_pcie: 1.7779 Gflop/s
outcome for spmv_csr_vector_sp: 58.5214 Gflop/s
outcome for spmv_csr_vector_sp_pcie: 2.7625 Gflop/s
outcome for spmv_csr_vector_dp: 45.8722 Gflop/s
outcome for spmv_csr_vector_dp_pcie: 1.8983 Gflop/s
outcome for spmv_csr_vector_pad_sp: 63.1210 Gflop/s
outcome for spmv_csr_vector_pad_sp_pcie: 2.8367 Gflop/s
outcome for spmv_csr_vector_pad_dp: 49.2344 Gflop/s
outcome for spmv_csr_vector_pad_dp_pcie: 1.9114 Gflop/s
outcome for spmv_ellpackr_sp: 54.0921 Gflop/s
outcome for spmv_ellpackr_dp: 37.1737 Gflop/s
Running benchmark Stencil2D
outcome for pattern: 424.1380 GFLOPS
outcome for stencil_dp: 263.4790 GFLOPS
Running benchmark Set of three
outcome for triad_bw: 16.2500 GB/s
Running benchmark S3D
outcome for s3d: 295.1980 GFLOPS
outcome for s3d_pcie: 205.1260 GFLOPS
outcome for s3d_dp: 161.5440 GFLOPS
outcome for s3d_dp_pcie: 109.4630 GFLOPS