The current version of Avro is 1.12.0
. The current version of libavro
is 24:0:1
.
This document was created 2024-07-25
.
1. Introduction to Avro
Avro is a data serialization system.
Avro provides:
-
Rich data structures.
-
A compact, fast, binary data format.
-
A container file, to store persistent data.
-
Remote procedure call (RPC).
This document will focus on the C implementation of Avro. To learn more about Avro in general, visit the Avro website.
2. Introduction to Avro C
___ ______
/ |_ ___________ / ____/
/ /| | | / / ___/ __ \ / /
/ ___ | |/ / / / /_/ / / /___
/_/ |_|___/_/ \____/ \____/
A C program is like a fast dance on a newly waxed dance floor by people carrying razors.
— Waldi Ravens
The C implementation has been tested on MacOSX
and Linux
but, over
time, the number of support OSes should grow. Please let us know if
you’re using Avro C
on other systems.
Avro depends on the Jansson JSON parser,
version 2.3 or higher. On many operating systems this library is
available through your package manager (for example,
apt-get install libjansson-dev
on Ubuntu/Debian, and
brew install jansson
on Mac OS). If not, please download and install
it from source.
The C implementation supports:
-
binary encoding/decoding of all primitive and complex data types
-
storage to an Avro Object Container File
-
schema resolution, promotion and projection
-
validating and non-validating mode for writing Avro data
The C implementation is lacking:
-
RPC
To learn about the API, take a look at the examples and reference files later in this document.
We’re always looking for contributions so, if you’re a C hacker, please feel free to submit patches to the project.
3. Error reporting
Most functions in the Avro C library return a single int
status code.
Following the POSIX errno.h convention, a status code of 0 indicates
success. Non-zero codes indicate an error condition. Some functions
return a pointer value instead of an int
status code; for these
functions, a NULL
pointer indicates an error.
You can retrieve
a string description of the most recent error using the avro_strerror
function:
4. Avro values
Starting with version 1.6.0, the Avro C library has a new API for
handling Avro data. To help distinguish between the two APIs, we refer
to the old one as the legacy or datum API, and the new one as the
value API. (These names come from the names of the C types used to
represent Avro data in the corresponding API — avro_datum_t
and
avro_value_t
.) The legacy API is still present, but it’s deprecated —
you shouldn’t use the avro_datum_t
type or the avro_datum_*
functions in new code.
One main benefit of the new value API is that you can treat any existing
C type as an Avro value; you just have to provide a custom
implementation of the value interface. In addition, we provide a
generic value implementation; “generic”, in this sense, meaning that
this single implementation works for instances of any Avro schema type.
Finally, we also provide a wrapper implementation for the deprecated
avro_datum_t
type, which lets you gradually transition to the new
value API.
4.1. Avro value interface
You interact with Avro values using the value interface, which defines
methods for setting and retrieving the contents of an Avro value. An
individual value is represented by an instance of the avro_value_t
type.
This section provides an overview of the methods that you can call on an
avro_value_t
instance. There are quite a few methods in the value
interface, but not all of them make sense for all Avro schema types.
For instance, you won’t be able to call avro_value_set_boolean
on an
Avro array value. If you try to call an inappropriate method, we’ll
return an EINVAL
/AVRO_INVALID
error code.
Note that the functions in this section apply to all Avro values, regardless of which value implementation is used under the covers. This section doesn’t describe how to create value instances, since those constructors will be specific to a particular value implementation.
4.1.1. Common methods
There are a handful of methods that can be used with any value, regardless of which Avro schema it’s an instance of:
The get_type
and get_schema
methods can be used to get information
about what kind of Avro value a given avro_value_t
instance
represents. (For get_schema
, you borrow the value’s reference to the
schema; if you need to save it and ensure that it outlives the value,
you need to call avro_schema_incref
on it.)
The equal
and equal_fast
methods compare two values for equality.
The two values do not have to have the same value implementations, but
they do have to be instances of the same schema. (Not equivalent
schemas; the same schema.) The equal
method checks that the schemas
match; the equal_fast
method assumes that they do.
The copy
and copy_fast
methods copy the contents of one Avro value
into another. (Where possible, this is done without copying the actual
content of a bytes
, string
, or fixed
value, using the
avro_wrapped_buffer_t
functions described in the next section.) Like
equal
, the two values must have the same schema; copy
checks this,
while copy_fast
assumes it.
The hash
method returns a hash value for the given Avro value. This
can be used to construct hash tables that use Avro values as keys. The
function works correctly even with maps; it produces a hash that doesn’t
depend on the ordering of the elements of the map. Hash values are only
meaningful for comparing values of exactly the same schema. Hash values
are not guaranteed to be consistent across different platforms, or
different versions of the Avro library. That means that it’s really
only safe to use these hash values internally within the context of a
single execution of a single application.
The reset
method “clears out” an avro_value_t
instance, making sure
that it’s ready to accept the contents of a new value. For scalars,
this is usually a no-op, since the new value will just overwrite the old
one. For arrays and maps, this removes any existing elements from the
container, so that we can append the elements of the new value. For
records and unions, this just recursively resets the fields or current
branch.
4.1.2. Scalar values
The simplest case is handling instances of the scalar Avro schema types.
In Avro, the scalars are all of the primitive schema types, as well as
enum
and fixed
— i.e., anything that can’t contain another Avro
value. Note that we use standard C99 types to represent the primitive
contents of an Avro scalar.
To retrieve the contents of an Avro scalar, you can use one of the getter methods:
For the most part, these should be self-explanatory. For bytes
,
string
, and fixed
values, the pointer to the underlying content is
const
— you aren’t allowed to modify the contents directly. We
guarantee that the content of a string
will be NUL-terminated, so you
can use it as a C string as you’d expect. The size
returned for a
string
object will include the NUL terminator; it will be one more
than you’d get from calling strlen
on the content.
Also, for bytes
, string
, and fixed
, the dest
and size
parameters are optional; if you only want to determine the length of a
bytes
value, you can use:
To set the contents of an Avro scalar, you can use one of the setter methods:
These are also straightforward. For bytes
, string
, and fixed
values, the set
methods will make a copy of the underlying data. For
string
values, the content must be NUL-terminated. You can use
set_string_len
if you already know the length of the string content;
the length you pass in should include the NUL terminator. If you call
set_string
, then we’ll use strlen
to calculate the length.
For fixed
values, the size
must match what’s expected by the value’s
underlying fixed
schema; if the sizes don’t match, you’ll get an error
code.
If you don’t want to copy the contents of a bytes
, string
, or
fixed
value, you can use the giver and grabber functions:
The give
functions give control of an existing buffer to the value.
(You should not try to free the src
wrapped buffer after calling
this method.) The grab
function fills in a wrapped buffer with a
pointer to the contents of an Avro value. (You should free the dest
wrapped buffer when you’re done with it.)
The avro_wrapped_buffer_t
struct encapsulates the location and size of
the existing buffer. It also includes several methods. The free
method will be called when the content of the buffer is no longer
needed. The slice
method will be called when the wrapped buffer needs
to be updated to point at a subset of what it pointed at before. (This
doesn’t create a new wrapped buffer; it updates an existing one.) The
copy
method will be called if the content needs to be copied. Note
that if you’re wrapping a buffer with nice reference counting features,
you don’t need to perform an actual copy; you just need to ensure that
the free
function can be called on both the original and the copy, and
not have things blow up.
The “generic” value implementation takes advantage of this feature; if
you pass in a wrapped buffer with a give
method, and then retrieve it
later with a grab
method, then we’ll use the wrapped buffer’s copy
method to fill in the dest
parameter. If your wrapped buffer
implements a slice
method that updates reference counts instead of
actually copying, then you’ve got nice zero-copy access to the contents
of an Avro value.
4.1.3. Compound values
The following sections describe the getter and setter methods for
handling compound Avro values. All of the compound values are
responsible for the storage of their children; this means that there
isn’t a method, for instance, that lets you add an existing
avro_value_t
to an array. Instead, there’s a method that creates a
new, empty avro_value_t
of the appropriate type, adds it to the array,
and returns it for you to fill in as needed.
You also shouldn’t try to free the child elements that are created this
way; the container value is responsible for their life cycle. The child
element is guaranteed to be valid for as long as the container value
is. You’ll usually define an avro_value_t
in the stack, and let it
fall out of scope when you’re done with it:
4.1.4. Arrays
There are three methods that can be used with array values:
The get_size
method returns the number of elements currently in the
array. The get_by_index
method fills in element
to point at the
array element with the given index. (You should use NULL
for the
unused
parameter; it’s ignored for array values.)
The append
method creates a new value, appends it to the array, and
returns it in element
. If new_index
is given, then it will be
filled in with the index of the new element.
4.1.5. Maps
There are four methods that can be used with map values:
The get_size
method returns the number of elements currently in the
map. Map elements can be retrieved either by their key (get_by_name
)
or by their numeric index (get_by_index
). (Numeric indices in a map
are based on the order that the elements were added to the map.) In
either case, the method takes in an optional output parameter that let
you retrieve the index associated with a key, and vice versa.
The add
method will add a new value to the map, if the given key isn’t
already present. If the key is present, then the existing value with be
returned. The index
parameter, if given, will be filled in the
element’s index. The is_new
parameter, if given, can be used to
determine whether the mapped value is new or not.
4.1.6. Records
There are three methods that can be used with record values:
The get_size
method returns the number of fields in the record. (You
can also get this by querying the value’s schema, but for some
implementations, this method can be faster.)
The get_by_index
and get_by_name
functions can be used to retrieve
one of the fields in the record, either by its ordinal position within
the record, or by the name of the underlying field. Like with maps, the
methods take in an additional parameter that let you retrieve the index
associated with a field name, and vice versa.
When possible, it’s recommended that you access record fields by their numeric index, rather than by their field name. For most implementations, this will be more efficient.
4.1.7. Unions
There are three methods that can be used with union values:
The get_discriminant
and get_current_branch
methods return the
current state of the union value, without modifying which branch is
currently selected. The set_branch
method can be used to choose the
active branch, filling in the branch
value to point at the branch’s
value instance. (Most implementations will be smart enough to detect
when the desired branch is already selected, so you should always call
this method unless you can guarantee that the right branch is already
current.)
4.2. Creating value instances
Okay, so we’ve described how to interact with a value that you already
have a pointer to, but how do you create one in the first place? Each
implementation of the value interface must provide its own functions for
creating avro_value_t
instances for that class. The 10,000-foot view
is to:
-
Get an implementation struct for the value implementation that you want to use. (This is represented by an
avro_value_iface_t
pointer.) -
Use the implementation’s constructor function to allocate instances of that value implementation.
-
Do whatever you need to the value (using the
avro_value_t
methods described in the previous section). -
Free the value instance, if necessary, using the implementation’s destructor function.
-
Free the implementation struct when you’re done creating value instances.
These steps use the following functions:
Note that for most value implementations, it’s fine to reuse a single
avro_value_t
instance for multiple values, using the
avro_value_reset
function before filling in the instance for each
value. (This helps reduce the number of malloc
and free
calls that
your application will make.)
We provide a “generic” value implementation that will work (efficiently) for any Avro schema.
For most applications, you won’t need to write your own value implementation; the Avro C library provides an efficient “generic” implementation, which supports the full range of Avro schema types. There’s a good chance that you just want to use this implementation, rather than rolling your own. (The primary reason for rolling your own would be if you want to access the elements of a compound value using C syntax — for instance, translating an Avro record into a C struct.) You can use the following functions to create and work with a generic value implementation for a particular schema:
Combining all of this together, you might have the following snippet of code:
5. Reference Counting
Avro C
does reference counting for all schema and data objects.
When the number of references drops to zero, the memory is freed.
For example, to create and free a string, you would use:
avro_datum_t string = avro_string("This is my string");
...
avro_datum_decref(string);
Things get a little more complicated when you consider more elaborate schema and data structures.
For example, let’s say that you create a record with a single string field:
avro_datum_t example = avro_record("Example");
avro_datum_t solo_field = avro_string("Example field value");
avro_record_set(example, "solo", solo_field);
...
avro_datum_decref(example);
In this example, the solo_field
datum would not be freed since it
has two references: the original reference and a reference inside
the Example
record. The avro_datum_decref(example)
call drops
the number of reference to one. If you are finished with the solo_field
schema, then you need to avro_schema_decref(solo_field)
to
completely dereference the solo_field
datum and free it.
6. Wrap It and Give It
You’ll notice that some datatypes can be "wrapped" and "given". This allows C programmers the freedom to decide who is responsible for the memory. Let’s take strings for example.
To create a string datum, you have three different methods:
avro_datum_t avro_string(const char *str);
avro_datum_t avro_wrapstring(const char *str);
avro_datum_t avro_givestring(const char *str);
If you use, avro_string
then Avro C
will make a copy of your
string and free it when the datum is dereferenced. In some cases,
especially when dealing with large amounts of data, you want
to avoid this memory copy. That’s where avro_wrapstring
and
avro_givestring
can help.
If you use, avro_wrapstring
then Avro C
will do no memory
management at all. It will just save a pointer to your data and
it’s your responsibility to free the string.
Warning
|
When using avro_wrapstring , do not free the string
before you dereference the string datum with avro_datum_decref() . |
Lastly, if you use avro_givestring
then Avro C
will free the
string later when the datum is dereferenced. In a sense, you
are "giving" responsibility for freeing the string to Avro C
.
Warning
|
Don’t "give" For example, don’t do this:
|
7. Schema Validation
If you want to write a datum, you would use the following function
If you pass in a writers_schema
, then you datum
will be validated before
it is sent to the writer
. This check ensures that your data has the
correct format. If you are certain your datum is correct, you can pass
a NULL
value for writers_schema
and Avro C
will not validate before
writing.
Note
|
Data written to an Avro File Object Container is always validated. |
8. Examples
I’m not even supposed to be here today!
Imagine you’re a free-lance hacker in Leonardo, New Jersey and you’ve been approached by the owner of the local Quick Stop Convenience store. He wants you to create a contact database case he needs to call employees to work on their day off.
You might build a simple contact system using Avro C like the following…
When you compile and run this program, you should get the following output
Successfully added Hicks, Dante id=1
Successfully added Graves, Randal id=2
Successfully added Loughran, Veronica id=3
Successfully added Bree, Caitlin id=4
Successfully added Silent, Bob id=5
Successfully added ???, Jay id=6
Avro is compact. Here is the data for all 6 people.
| 02 0A 44 61 6E 74 65 0A | 48 69 63 6B 73 1C 28 35 | ..Dante.Hicks.(5
| 35 35 29 20 31 32 33 2D | 34 35 36 37 40 04 0C 52 | 55) 123-4567@..R
| 61 6E 64 61 6C 0C 47 72 | 61 76 65 73 1C 28 35 35 | andal.Graves.(55
| 35 29 20 31 32 33 2D 35 | 36 37 38 3C 06 10 56 65 | 5) 123-5678<..Ve
| 72 6F 6E 69 63 61 10 4C | 6F 75 67 68 72 61 6E 1C | ronica.Loughran.
| 28 35 35 35 29 20 31 32 | 33 2D 30 39 38 37 38 08 | (555) 123-09878.
| 0E 43 61 69 74 6C 69 6E | 08 42 72 65 65 1C 28 35 | .Caitlin.Bree.(5
| 35 35 29 20 31 32 33 2D | 32 33 32 33 36 0A 06 42 | 55) 123-23236..B
| 6F 62 0C 53 69 6C 65 6E | 74 1C 28 35 35 35 29 20 | ob.Silent.(555)
| 31 32 33 2D 36 34 32 32 | 3A 0C 06 4A 61 79 06 3F | 123-6422:..Jay.?
| 3F 3F 1C 28 35 35 35 29 | 20 31 32 33 2D 39 31 38 | ??.(555) 123-918
| 32 34 .. .. .. .. .. .. | .. .. .. .. .. .. .. .. | 24..............
Now let's read all the records back out
1 | Dante | Hicks | (555) 123-4567 | 32
2 | Randal | Graves | (555) 123-5678 | 30
3 | Veronica | Loughran | (555) 123-0987 | 28
4 | Caitlin | Bree | (555) 123-2323 | 27
5 | Bob | Silent | (555) 123-6422 | 29
6 | Jay | ??? | (555) 123-9182 | 26
Use projection to print only the First name and phone numbers
Dante | (555) 123-4567 |
Randal | (555) 123-5678 |
Veronica | (555) 123-0987 |
Caitlin | (555) 123-2323 |
Bob | (555) 123-6422 |
Jay | (555) 123-9182 |
The Quick Stop owner was so pleased, he asked you to create a movie database for his RST Video store.
9. Reference files
9.1. avro.h
The avro.h
header file contains the complete public API
for Avro C
. The documentation is rather sparse right now
but we’ll be adding more information soon.
9.2. test_avro_data.c
Another good way to learn how to encode/decode data in Avro C
is
to look at the test_avro_data.c
unit test. This simple unit test
checks that all the avro types can be encoded/decoded correctly.