Architecture

Distributed Kafka Avro Schema System via Registry

Maziz
5 min readSep 15, 2020

In some cases, you may choose to use kafka + avro for internal communication in your micro-service architecture. One thing you have to consider is the distribution of the avro source (the .avsc) among the micro-services. Imagine having a group of micro-services that produce some messages and and another consuming it. How would it go? Copy by script to each one of them perhaps?:P

Approach Overview

We are going to see 2 approaches to distribute the schema sources among the micro-services.

Artifact repository

The idea is to use a separate module (maven’s sub-project or gradle’s multi projects build) in your project that contains all the avro schema for the producer. This module will be published as an artifact on an artifact repository.

Other micro-service will then refer to this module as project dependency. Resolving this dependency will enable access to the schema source which then can be built into POJO using avro-maven plugin.

With this approach there’s no schema validation available during send / receive message.

Confluent Schema Registry

This is the approach that we will focus for this article. For this approach we will be using schema registry by Confluent.

Figure A: Flow of Build and Run time

The idea is to use confluent schema maven plugin and avro maven plugin goals such as register (uploading the avro source to the register), download (self descriptive), schema (avro maven plugin goal)to the maven build phase.

We bind the goals in such a way where the register and download plugin happen before building the POJO from the avsc.

Because of this, the schema registry will be accessed from the build pipeline during build time and from the microservices’ kafka clients during run time. Keep that in mind especially when designing your environments and build pipeline ;).

Lets dive deeper in the next section.

Confluent Schema Registry Maven Plugin

The schema source can be written as flat schema (in-lined) or split into multiple schema through schema reference for re-usability. However, schema reference is only supported using version 5.5.1 (latest at the time, August 2020).

Setup In Detail

For the producer micro-service, it will contain the schema source. Via maven, we will be using avro-maven-plugin and also the confluent’s maven schema registry maven plugin to register the schema.

We will have to bind the register goal to one of the build phase (let’s say for example initialize phase as we want to upload to the schema registry before building the POJO). Then using avro-maven-plugin, we build the POJO from the avsc.

You can see the part of the maven config below for the producer:

For the consumer micro-service, it will use the confluent’s maven schema registry maven plugin to download the schema source. Then, using the avro-maven-plugin, it will build the Java POJO from the downloaded .avsc. You can see this in step 3 and 4 in Figure A and the plugin configuration as follows:

As you can see above in the subjects and pattern properties, Flat prefixed name indicates flat avro schema. Split prefixed name indicates schema reference or split schema example.

Flat Schema Source

Straight forward implementation. You define complex data types in-lined and not separated them into files. No minimum version required for the dependencies to work.

org.maz.schema.FlatSampleData.avsc

Split Schema Source with Schema Reference

As an example we divide the schema into 2 source files, org.maz.schema.SplitEnumerationSample.avsc and org.maz.schema.SplitSampleData.avsc.

The following is the org.maz.schema.SplitEnumerationSample.avsc, which is our custom data type which will be used in the main schema, org.maz.schema.SplitSampleData.avsc.org.maz.schema.SplitSampleData.avsc.

This is the file that you need to reference from your main schema.

Next is the main schema.

If you notice, the name of the files are in a form of fully qualified record name (package and the record name). We are using RecordNameStrategy for Subject name strategy. This determines how the schema is registered under the subject. In addition this strategy allows us to split your schema into multiple files.

The default configuration, TopicNameStrategy would mean that the schema subject must be named form the topic name. Having topic name “sample-data” means requiring the schema name to be “subject-data” with suffix either key or value (IE: subject-data-value). This also means you can only have 1 schema per topic, hence disabling the split schema feature.

The third one is mixture of the above, TopicRecordNameStrategy. You can get more detail at https://docs.confluent.io/current/schema-registry/serdes-develop/index.html#how-the-naming-strategies-work

This can be configured on either the producer or consumer client configuration properties.

Dependency versions

  1. Schema Registry 5.5.1
  2. Kafka Avro Serializer, 5.5.1
  3. Avro version 1.9.2 at least. The above dependencies will require at least this version to work.
  4. Avro Maven Plugin at least version 1.9.2, also because of the above.

Configuration

  1. Set both value Serializer and Deserializer to confluent.kafka enquivalents. You can see i nthe gist below.
  2. Add the “auto.register.schemas” in the kafka properties config with value false. We don’t want any new schema to be automatically registered.
  3. Add the “use.latest.version” in the kafka properties with value true. We want the serializer to always lookup for the latest schema when sending message.
  4. Ensure that the both producer and consumer has the property “value.subject.name.strategy” with value “io.confluent.kafka.serializers.subject.RecordNameStrategy

All this can be done easily in the application.yaml if you are using Spring Boot.

You can see below:

The “auto.register.schemas”, “ value.subject.name.strategy” and “use.latest.version” should have the same value for consumer properties as well.

Conclusion

Pros

  1. Allow schema registry to handle the distribution of the source instead of doing it manually (updates / latest schema version will be handle via the registry mechanism).
  2. At the same time provide validation during send / receive (compared to the artifact repository approach)

Cons

  1. Vendor lock in with Confluent (though would totally be fine if you straight up using Confluent-Clound, manage kafka service)
  2. Setting up kafka on any other environment beside confluent cloud would required you to setup schema registry as well with at least an ELB with 2 instances of the registry on production for availability sake.
  3. (If you are using flink / anything old) Version transitive dependency. Since you need 5.5.1 which then require you to use avro 1.9 at least, you need to ensure your transitive dependency works with those version across services. God forbid for example, you want to use AWS Kinesis Data Analaytics Flink App (which supports 1.8 Flink, which uses avro 1.8 and older confluent avro client) then you are left no choice to forgo the schema reference /split schema feature.

All in All

Pretty nifty way to ensure your messages are schematized and able to be distributed among the services with addition of validation during send / receive.

At the same time it comes at a cost and overhead for the additional schema registry component within the architecture.

You can get the full code and run locally from https://github.com/MazizEsa/distributed-avro-schema-registry-example

--

--

Maziz

Someone who bricked a motherboard after flashing the wrong rom because of trying to OC an intel celeron from 400mhz to beyond 600mhz in 7th grade.