Device Interface Reference Documentation

Version Queries

pycuda.VERSION

Gives the numeric version of PyCUDA as a variable-length tuple of integers. Enables easy version checks such as VERSION >= (0, 93).

Added in PyCUDA 0.93.

pycuda.VERSION_STATUS

A text string such as “rc4” or “beta” qualifying the status of the release.

New in version 0.93.

pycuda.VERSION_TEXT

The full release name (such as “0.93rc4”) in string form.

New in version 0.93.

Error Reporting

exception pycuda.driver.Error
Base class of all PyCuda errors.
exception pycuda.driver.CompileError

Thrown when SourceModule compilation fails.

msg

New in version 0.94.

stdout

New in version 0.94.

stderr

New in version 0.94.

command_line

New in version 0.94.

exception pycuda.driver.MemoryError
Thrown when mem_alloc() or related functionality fails.
exception pycuda.driver.LogicError

Thrown when PyCuda was confronted with a situation where it is likely that the programmer has made a mistake. LogicErrors do not depend on outer circumstances defined by the run-time environment.

Example: CUDA was used before it was initialized.

exception pycuda.driver.LaunchError
Thrown when kernel invocation has failed. (Note that this will often be reported by the next call after the actual kernel invocation.)
exception pycuda.driver.RuntimeError

Thrown when a unforeseen run-time failure is encountered that is not likely due to programmer error.

Example: A file was not found.

Constants

class pycuda.driver.ctx_flags

Flags for Device.make_context(). CUDA 2.0 and above only.

SCHED_AUTO
If there are more contexts than processors, yield, otherwise spin while waiting for CUDA calls to complete.
SCHED_SPIN
Spin while waiting for CUDA calls to complete.
SCHED_YIELD
Yield to other threads while waiting for CUDA calls to complete.
SCHED_MASK
Mask of valid scheduling flags in this bitfield.
BLOCKING_SYNC
Use blocking synchronization. CUDA 2.2 and newer.
MAP_HOST
Support mapped pinned allocations. CUDA 2.2 and newer.
FLAGS_MASK
Mask of valid flags in this bitfield.
class pycuda.driver.event_flags

Flags for Event. CUDA 2.2 and newer.

DEFAULT
BLOCKING_SYNC
class pycuda.driver.device_attribute
MAX_THREADS_PER_BLOCK
MAX_BLOCK_DIM_X
MAX_BLOCK_DIM_Y
MAX_BLOCK_DIM_Z
MAX_GRID_DIM_X
MAX_GRID_DIM_Y
MAX_GRID_DIM_Z
TOTAL_CONSTANT_MEMORY
WARP_SIZE
MAX_PITCH
CLOCK_RATE
TEXTURE_ALIGNMENT
GPU_OVERLAP
MULTIPROCESSOR_COUNT
CUDA 2.0 and above only.
SHARED_MEMORY_PER_BLOCK
Deprecated as of CUDA 2.0. See below for replacement.
MAX_SHARED_MEMORY_PER_BLOCK
CUDA 2.0 and above only.
REGISTERS_PER_BLOCK
Deprecated as of CUDA 2.0. See below for replacement.
MAX_REGISTERS_PER_BLOCK
CUDA 2.0 and above.
KERNEL_EXEC_TIMEOUT
CUDA 2.2 and above.
INTEGRATED
CUDA 2.2 and above.
CAN_MAP_HOST_MEMORY
CUDA 2.2 and above.
COMPUTE_MODE
CUDA 2.2 and above. See compute_mode.
MAXIMUM_TEXTURE1D_WIDTH
MAXIMUM_TEXTURE2D_WIDTH
MAXIMUM_TEXTURE2D_HEIGHT
MAXIMUM_TEXTURE3D_WIDTH
MAXIMUM_TEXTURE3D_HEIGHT
MAXIMUM_TEXTURE3D_DEPTH
MAXIMUM_TEXTURE2D_ARRAY_WIDTH
MAXIMUM_TEXTURE2D_ARRAY_HEIGHT
MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES

CUDA 3.0 and above

New in version 0.94.

SURFACE_ALIGNMENT

CUDA 3.0 (post-beta) and above

New in version 0.94.

CONCURRENT_KERNELS

CUDA 3.0 (post-beta) and above

New in version 0.94.

ECC_ENABLED

CUDA 3.0 (post-beta) and above

New in version 0.94.

class pycuda.driver.function_attribute

Flags for Function.get_attribute(). CUDA 2.2 and newer.

MAX_THREADS_PER_BLOCK
SHARED_SIZE_BYTES
CONST_SIZE_BYTES
LOCAL_SIZE_BYTES
NUM_REGS
PTX_VERSION

CUDA 3.0 (post-beta) and above

New in version 0.94.

BINARY_VERSION

CUDA 3.0 (post-beta) and above

New in version 0.94.

MAX
class pycuda.driver.func_cache

See Function.set_cache_config(). CUDA 3.0 (post-beta) and above

New in version 0.94.

PREFER_NONE
PREFER_SHARED
PREFER_L1
class pycuda.driver.array_format
UNSIGNED_INT8
UNSIGNED_INT16
UNSIGNED_INT32
SIGNED_INT8
SIGNED_INT16
SIGNED_INT32
HALF
FLOAT
class pycuda.driver.array3d_flags
class pycuda.driver.address_mode
WRAP
CLAMP
MIRROR
class pycuda.driver.filter_mode
POINT
LINEAR
class pycuda.driver.memory_type
HOST
DEVICE
ARRAY
class pycuda.driver.compute_mode

CUDA 2.2 and newer.

DEFAULT
EXCLUSIVE
PROHIBITED
class pycuda.driver.jit_option

CUDA 2.1 and newer.

MAX_REGISTERS
THREADS_PER_BLOCK
WALL_TIME
INFO_LOG_BUFFER
INFO_LOG_BUFFER_SIZE_BYTES
ERROR_LOG_BUFFER
ERROR_LOG_BUFFER_SIZE_BYTES
OPTIMIZATION_LEVEL
TARGET_FROM_CUCONTEXT
TARGET
FALLBACK_STRATEGY
class pycuda.driver.jit_target

CUDA 2.1 and newer.

COMPUTE_10
COMPUTE_11
COMPUTE_12
COMPUTE_13
COMPUTE_20

CUDA 3.0 and above

New in version 0.94.

class pycuda.driver.jit_fallback

CUDA 2.1 and newer.

PREFER_PTX
PREFER_BINARY
class pycuda.driver.host_alloc_flags

Flags to be used to allocate Pagelocked Host Memory.

PORTABLE
DEVICEMAP
WRITECOMBINED

Devices and Contexts

pycuda.driver.get_version()
Obtain the version of CUDA against which PyCuda was compiled. Returns a 3-tuple of integers as (major, minor, revision).
pycuda.driver.get_driver_version()
Obtain the version of the CUDA driver on top of which PyCUDA is running. Returns an integer version number.
pycuda.driver.init(flags=0)

Initialize CUDA.

Warning

This must be called before any other function in this module.

See also pycuda.autoinit.

class pycuda.driver.Device(number)

A handle to the number‘th CUDA device. See also pycuda.autoinit.

static count()
Return the number of CUDA devices found.
name()
Return the name of this CUDA device.
compute_cabability()
Return a 2-tuple indicating the compute capability version of this device.
total_memory()
Return the total amount of memory on the device in bytes.
get_attribute(attr)

Return the (numeric) value of the attribute attr, which may be one of the device_attribute values.

All device_attribute values may also be directly read as (lower-case) attributes on the Device object itself, e.g. dev.clock_rate.

get_attributes()
Return all device attributes in a dict, with keys from device_attribute.
make_context(flags=ctx_flags.SCHED_AUTO)

Create a Context on this device, with flags taken from the ctx_flags values.

Also make the newly-created context the current context.

__hash__()
__eq__()
__ne__()
class pycuda.driver.Context

An equivalent of a UNIX process on the compute device. Create instances of this class using Device.make_context(). See also pycuda.autoinit.

detach()
Decrease the reference count on this context. If the reference count hits zero, the context is deleted.
push()
Make self the active context, pushing it on top of the context stack. CUDA 2.0 and above only.
static pop()
Remove any context from the top of the context stack, deactivating it. CUDA 2.0 and above only.
static get_device()
Return the device that the current context is working on.
static synchronize()
Wait for all activity in the current context to cease, then return.

Concurrency and Streams

class pycuda.driver.Stream(flags=0)

A handle for a queue of operations that will be carried out in order.

synchronize()
Wait for all activity on this stream to cease, then return.
is_done()
Return True iff all queued operations have completed.
class pycuda.driver.Event(flags=0)

An event is a temporal ‘marker’ in a Stream that allows taking the time between two events–such as the time required to execute a kernel. An event’s time is recorded when the Stream has finished all tasks enqueued before the record() call.

See event_flags for values for the flags parameter.

record(stream=None)
Insert a recording point for self into the Stream stream. Return self.
synchronize()
Wait until the device execution stream reaches this event. Return self.
query()
Return True if the device execution stream has reached this event.
time_since(event)
Return the time in milliseconds that has passed between self and event. Use this method as end.time_since(start). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use synchronize() to ensure that the event has been reached.
time_till(event)
Return the time in milliseconds that has passed between event and self. Use this method as start.time_till(end). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use synchronize() to ensure that the event has been reached.

Memory

Global Device Memory

pycuda.driver.mem_get_info()
Return a tuple (free, total) indicating the free and total memory in the current context, in bytes.
pycuda.driver.mem_alloc(bytes)
Return a DeviceAllocation object representing a linear piece of device memory.
pycuda.driver.to_device(buffer)
Allocate enough device memory for buffer, which adheres to the Python buffer interface. Copy the contents of buffer onto the device. Return a DeviceAllocation object representing the newly-allocated memory.
pycuda.driver.from_device(devptr, shape, dtype, order="C")
Make a new numpy.ndarray from the data at devptr on the GPU, interpreting them using shape, dtype and order.
pycuda.driver.from_device_like(devptr, other_ary)
Make a new numpy.ndarray from the data at devptr on the GPU, interpreting them as having the same shape, dtype and order as other_ary.
pycuda.driver.mem_alloc_pitch(width, height, access_size)

Allocates a linear piece of device memory at least width bytes wide and height rows high that an be accessed using a data type of size access_size in a coalesced fashion.

Returns a tuple (dev_alloc, actual_pitch) giving a DeviceAllocation and the actual width of each row in bytes.

class pycuda.driver.DeviceAllocation

An object representing an allocation of linear device memory. Once this object is deleted, its associated device memory is freed.

Objects of this type can be cast to int to obtain a linear index into this Context‘s memory.

free()
Release the held device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior.

Pagelocked Host Memory

pycuda.driver.pagelocked_empty(shape, dtype, order="C", mem_flags=0)

Allocate a pagelocked numpy.ndarray of shape, dtype and order.

mem_flags may be one of the values in host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer.

For the meaning of the other parameters, please refer to the numpy documentation.

pycuda.driver.pagelocked_zeros(shape, dtype, order="C", mem_flags=0)

Allocate a pagelocked numpy.ndarray of shape, dtype and order that is zero-initialized.

mem_flags may be one of the values in host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer.

For the meaning of the other parameters, please refer to the numpy documentation.

pycuda.driver.pagelocked_empty_like(array, mem_flags=0)

Allocate a pagelocked numpy.ndarray with the same shape, dtype and order as array.

mem_flags may be one of the values in host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer.

pycuda.driver.pagelocked_zeros_like(array, mem_flags=0)

Allocate a pagelocked numpy.ndarray with the same shape, dtype and order as array. Initialize it to 0.

mem_flags may be one of the values in host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer.

The numpy.ndarray instances returned by these functions have an attribute base that references an object of type

class pycuda.driver.HostAllocation

An object representing an allocation of pagelocked host memory. Once this object is deleted, its associated device memory is freed.

free()
Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated numpy array) is an error and will lead to undefined behavior.
get_device_pointer()

Return a device pointer that indicates the address at which this memory is mapped into the device’s address space.

Only available on CUDA 2.2 and newer.

Arrays and Textures

class pycuda.driver.ArrayDescriptor
width
height
format
A value of type array_format.
num_channels
class pycuda.driver.ArrayDescriptor3D
width
height
depth
format
A value of type array_format. CUDA 2.0 and above only.
num_channels
class pycuda.driver.Array(descriptor)

A 2D or 3D memory block that can only be accessed via texture references.

descriptor can be of type ArrayDescriptor or ArrayDescriptor3D.

free()
Release the array and its device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior.
get_descriptor()
Return a ArrayDescriptor object for this 2D array, like the one that was used to create it.
get_descriptor_3d()
Return a ArrayDescriptor3D object for this 3D array, like the one that was used to create it. CUDA 2.0 and above only.
class pycuda.driver.TextureReference

A handle to a binding of either linear memory or an Array to a texture unit.

set_array(array)

Bind self to the Array array.

As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array.

set_address(devptr, bytes, allow_offset=False)

Bind self to the a chunk of linear memory starting at the integer address devptr, encompassing a number of bytes. Due to alignment requirements, the effective texture bind address may be different from the requested one by an offset. This method returns this offset in bytes. If allow_offset is False, a nonzero value of this offset will cause an exception to be raised.

Unlike for Array objects, no life support is provided for linear memory bound to texture references.

set_address_2d(devptr, descr, pitch)
Bind self as a 2-dimensional texture to a chunk of global memory at devptr. The line-to-line offset in bytes is given by pitch. Width, height and format are given in the ArrayDescriptor descr. set_format() need not and should not be called in addition to this method.
set_format(fmt, num_components)
Set the texture to have array_format fmt and to have num_components channels.
set_address_mode(dim, am)
Set the address mode of dimension dim to am, which must be one of the address_mode values.
set_flags(flags)
Set the flags to a combination of the TRSF_XXX values.
get_array()
Get back the Array to which self is bound.
get_address_mode(dim)
get_filter_mode()
get_format()

Return a tuple (fmt, num_components), where fmt is of type array_format, and num_components is the number of channels in this texture.

(Version 2.0 and above only.)

get_flags()
pycuda.driver.TRSA_OVERRIDE_FORMAT
pycuda.driver.TRSF_READ_AS_INTEGER
pycuda.driver.TRSF_NORMALIZED_COORDINATES
pycuda.driver.TR_DEFAULT
pycuda.driver.matrix_to_array(matrix, order)
Turn the two-dimensional numpy.ndarray object matrix into an Array. The order argument can be either “C” or “F”. If it is “C”, then tex2D(x,y) is going to fetch matrix[y,x], and vice versa for for “F”.
pycuda.driver.make_multichannel_2d_array(matrix, order)

Turn the three-dimensional numpy.ndarray object matrix into an 2D Array with multiple channels.

Depending on order, the matrix‘s shape is interpreted as

  • height, width, num_channels for order == “C”,
  • num_channels, width, height for order == “F”.

Note

This function assumes that matrix has been created with the memory order order. If that is not the case, the copied data will likely not be what you expect.

Initializing Device Memory

pycuda.driver.memset_d8(dest, data, count)
pycuda.driver.memset_d16(dest, data, count)
pycuda.driver.memset_d32(dest, data, count)

Note

count is the number of elements, not bytes.

pycuda.driver.memset_d2d8(dest, pitch, data, width, height)
pycuda.driver.memset_d2d16(dest, pitch, data, width, height)
pycuda.driver.memset_d2d32(dest, pitch, data, width, height)

Unstructured Memory Transfers

pycuda.driver.memcpy_htod(dest, src)
Copy from the Python buffer src to the device pointer dest (an int or a DeviceAllocation). The size of the copy is determined by the size of the buffer.
pycuda.driver.memcpy_htod_async(dest, src, stream=None)

Copy from the Python buffer src to the device pointer dest (an int or a DeviceAllocation) asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.

New in 0.93.

pycuda.driver.memcpy_dtoh(dest, src)

Copy from the device pointer src (an int or a DeviceAllocation) to the Python buffer dest. The size of the copy is determined by the size of the buffer.

Optionally execute asynchronously, serialized via stream. In this case, dest must be page-locked.

pycuda.driver.memcpy_dtoh_async(dest, src, stream=None)
Copy from the device pointer src (an int or a DeviceAllocation) to the Python buffer dest asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.
pycuda.driver.memcpy_dtod(dest, src, size)
pycuda.driver.memcpy_dtod_async(dest, src, size, stream=None)

CUDA 3.0 and above

New in version 0.94.

pycuda.driver.memcpy_dtoa(ary, index, src, len)
pycuda.driver.memcpy_atod(dest, ary, index, len)
pycuda.driver.memcpy_htoa(ary, index, src)
pycuda.driver.memcpy_atoh(dest, ary, index)
pycuda.driver.memcpy_atoa(dest, dest_index, src, src_index, len)

Structured Memory Transfers

class pycuda.driver.Memcpy2D
src_x_in_bytes
X Offset of the origin of the copy. (initialized to 0)
src_y
Y offset of the origin of the copy. (initialized to 0)
src_pitch
Size of a row in bytes at the origin of the copy.
set_src_host(buffer)
Set the buffer, which must be a Python object adhering to the buffer interface, to be the origin of the copy.
set_src_array(array)
Set the Array array to be the origin of the copy.
set_src_device(devptr)
Set the device address devptr (an int or a DeviceAllocation) as the origin of the copy.
dst_x_in_bytes
X offset of the destination of the copy. (initialized to 0)
dst_y
Y offset of the destination of the copy. (initialized to 0)
dst_pitch
Size of a row in bytes at the destination of the copy.
set_dst_host(buffer)
Set the buffer, which must be a Python object adhering to the buffer interface, to be the destination of the copy.
set_dst_array(array)
Set the Array array to be the destination of the copy.
set_dst_device(devptr)
Set the device address devptr (an int or a DeviceAllocation) as the destination of the copy.
width_in_bytes
Number of bytes to copy for each row in the transfer.
height
Number of rows to copy.
__call__([aligned=True])
Perform the specified memory copy, waiting for it to finish. If aligned is False, tolerate device-side misalignment for device-to-device copies that may lead to loss of copy bandwidth.
__call__(stream)
Perform the memory copy asynchronously, serialized via the Stream stream. Any host memory involved in the transfer must be page-locked.
class pycuda.driver.Memcpy3D

Memcpy3D has the same members as Memcpy2D, and additionally all of the following:

src_height
Ignored when source is an Array. May be 0 if Depth==1.
src_z
Z offset of the origin of the copy. (initialized to 0)
dst_height
Ignored when destination is an Array. May be 0 if Depth==1.
dst_z
Z offset of the destination of the copy. (initialized to 0)
depth

Memcpy3D is supported on CUDA 2.0 and above only.

Code on the Device: Modules and Functions

class pycuda.driver.Module

Handle to a CUBIN module loaded onto the device. Can be created with module_from_file() and module_from_buffer().

get_function(name)

Return the Function name in this module.

Warning

While you can obtain different handles to the same function using this method, these handles all share the same state that is set through the set_XXX methods of Function. This means that you can’t obtain two different handles to the same function and Function.prepare() them in two different ways.

get_global(name)

Return a tuple (device_ptr, size_in_bytes) giving the device address and size of the global name.

The main use of this method is to find the address of pre-declared __constant__ arrays so they can be filled from the host before kernel invocation.

get_texref(name)
Return the TextureReference name from this module.
pycuda.driver.module_from_file(filename)
Create a Module by loading the CUBIN file filename.
pycuda.driver.module_from_buffer(buffer, options=[], message_handler=None)

Create a Module by loading a PTX or CUBIN module from buffer, which must support the Python buffer interface. (For example, str and numpy.ndarray do.)

Parameters:
  • options – A list of tuples (jit_option, value).
  • message_handler – A callable that is called with a arguments of (compile_success_bool, info_str, error_str) which allows the user to process error and warning messages from the PTX compiler.

Loading PTX modules as well as non-default values of options and message_handler are only allowed on CUDA 2.1 and newer.

class pycuda.driver.Function

Handle to a __global__ function in a Module. Create using Module.get_function().

__call__(arg1, ..., argn, block=block_size[, grid=(1, 1)[, stream=None[, shared=0[, texrefs=[][, time_kernel=False]]]]])

Launch self, with a thread block size of block. block must be a 3-tuple of integers.

arg1 through argn are the positional C arguments to the kernel. See param_set() for details. See especially the warnings there.

grid specifies, as a 2-tuple, the number of thread blocks to launch, as a two-dimensional grid. stream, if specified, is a Stream instance serializing the copying of input arguments (if any), execution, and the copying of output arguments (again, if any). shared gives the number of bytes available to the kernel in extern __shared__ arrays. texrefs is a list of TextureReference instances that the function will have access to.

The function returns either None or the number of seconds spent executing the kernel, depending on whether time_kernel is True.

This is a convenience interface that can be used instead of the param_*() and launch_*() methods below. For a faster (but mildly less convenient) way of invoking kernels, see prepare() and prepared_call().

param_set(arg1, ... argn)

Set up arg1 through argn as positional C arguments to self. They are allowed to be of the following types:

  • Subclasses of numpy.number. These are sized number types such as numpy.uint32 or numpy.float32.
  • DeviceAllocation instances, which will become a device pointer to the allocated memory.
  • Instances of ArgumentHandler subclasses. These can be used to automatically transfer numpy arrays onto and off of the device.
  • Objects supporting the Python buffer interface. These chunks of bytes will be copied into the parameter space verbatim.
  • GPUArray instances.

Warning

You cannot pass values of Python’s native int or float types to param_set. Since there is no unambiguous way to guess the size of these integers or floats, it complains with a TypeError.

Note

This method has to guess the types of the arguments passed to it, which can make it somewhat slow. For a kernel that is invoked often, this can be inconvenient. For a faster (but mildly less convenient) way of invoking kernels, see prepare() and prepared_call().

set_block_shape(x, y, z)
Set the thread block shape for this function.
set_shared_size(bytes)
Set shared to be the number of bytes available to the kernel in extern __shared__ arrays.
param_set_size(bytes)
Size the parameter space to bytes.
param_seti(offset, value)
Set the integer at offset in the parameter space to value.
param_setf(offset, value)
Set the float at offset in the parameter space to value.
param_set_texref(texref)
Make the TextureReference texref available to the function.
launch()
Launch a single thread block of self.
launch_grid(width, height)
Launch a width*height grid of thread blocks of self.
launch_grid_async(width, height, stream)
Launch a width*height grid of thread blocks of self, sequenced by the Stream stream.
prepare(arg_types, block, shared=None, texrefs=[])

Prepare the invocation of this function by

  • setting up the argument types as arg_types. arg_types is expected to be an iterable containing type characters understood by the struct module or numpy.dtype objects.

    (In addition, PyCUDA understands ‘F’ and ‘D’ for single- and double precision floating point numbers.)

  • setting the thread block shape for this function to block.

  • Registering the texture references texrefs for use with this functions. The TextureReference objects in texrefs will be retained, and whatever these references are bound to at invocation time will be available through the corresponding texture references within the kernel.

Return self.

prepared_call(grid, *args)
Invoke self using launch_grid(), with args and a grid size of grid. Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.
prepared_timed_call(grid, *args)

Invoke self using launch_grid(), with args and a grid size of grid. Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.

Return a 0-ary callable that can be used to query the GPU time consumed by the call, in seconds. Once called, this callable will block until completion of the invocation.

prepared_async_call(grid, stream, *args)
Invoke self using launch_grid_async(), with args and a grid size of grid, serialized into the pycuda.driver.Stream stream. If stream is None, do the same as prepared_call(). Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.
get_attribute(attr)

Return one of the attributes given by the function_attribute value attr.

All function_attribute values may also be directly read as (lower-case) attributes on the Function object itself, e.g. func.num_regs.

CUDA 2.2 and newer.

New in version 0.93.

set_cache_config(fc)

CUDA 3.0 (post-beta) and newer.

New in version 0.94.

local_size_bytes

The number of bytes of local memory used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a SourceModule. It replaces the now-deprecated attribute lmem.

shared_size_bytes

The number of bytes of shared memory used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a SourceModule. It replaces the now-deprecated attribute smem.

num_regs

The number of 32-bit registers used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a SourceModule. It replaces the now-deprecated attribute registers.

class pycuda.driver.ArgumentHandler(array)
class pycuda.driver.In(array)
Inherits from ArgumentHandler. Indicates that buffer array should be copied to the compute device before invoking the kernel.
class pycuda.driver.Out(array)
Inherits from ArgumentHandler. Indicates that buffer array should be copied off the compute device after invoking the kernel.
class pycuda.driver.InOut(array)
Inherits from ArgumentHandler. Indicates that buffer array should be copied both onto the compute device before invoking the kernel, and off it afterwards.

Just-in-time Compilation

class pycuda.compiler.SourceModule(source, nvcc="nvcc", options=[], keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None)

Create a Module from the CUDA source code source. The Nvidia compiler nvcc is assumed to be on the PATH if no path to it is specified, and is invoked with options to compile the code. If keep is True, the compiler output directory is kept, and a line indicating its location in the file system is printed for debugging purposes.

Unless no_extern_c is True, the given source code is wrapped in extern “C” { ... } to prevent C++ name mangling.

arch and code specify the values to be passed for the -arch and -code options on the nvcc command line. If arch is None, it defaults to the current context’s device’s compute capability. If code is None, it will not be specified.

cache_dir gives the directory used for compiler caching. It has a sensible per-user default. If it is set to False, caching is disabled.

This class exhibits the same public interface as Module, but does not inherit from it.

Change note: SourceModule was moved from pycuda.driver to pycuda.compiler in version 0.93.

compile(source, nvcc="nvcc", options=[], keep=False,
no_extern_c=False, arch=None, code=None, cache_dir=None,
include_dirs=[])
Perform the same compilation as the corresponding SourceModule constructor, but only return resulting cubin file as a string. In particular, do not upload the code to the GPU.