Skip to main content

Designing Key/Value Repository API with Java Optional

I spent some time last month by defining our repository API. Repository is commonly component used by service layer in your application to persist the data. In the time of polyglot persistence, we use this repository design discussed in this article to persist business domain model - designed according to (our experience with) domain driven design.

Lessons Learned

We have large experience since we used nhibernate as a persistent framework in earlier product version. First, and naive, idea consist in allowing the programmers to write queries to the database on his own. Unfortunately the idea failed soon. This scenario heavily relied on a belief that every programmer knows how persistence/database work and s/he wants to write those queries effectively. It inevitably inflicted error-prone and inefficient queries. Essentially, nobody was responsible for the repositories because everyone contributed to them. Persistence components was just a framework.

The whole experience implies to design very strong and highly review-able API designed by technology-aware engineers. Usually with strong commitment to all dependent layers.

Technical Implications Affects the API

The API must obviously reflect functional requirements. They are what we want the repository to do. According to our experience, such API must also reflect technical and implementation implications. Basically, the design without knowing if the implementation will use SQL database or NoSQL Key/Value store or what are boundaries of domain aggregates will result to not efficient implementation.

To provide more realistic example, lets talk about address, consider it as an aggregate. The repository usually provides CRUD methods for the address. But what if there is a functional requirement to return only address' street? Should the API contain such method, e.g. get street by address id?

It depends on technical implementation:
  1. What is typical maximal size of serialized address, e.g. resulting json? Does it fit to one tcp packet traveling through network or does it fit to one read operation from hard drive on the storage node? I mean: does even make any sense to fetch partial entity contrary to full entity?
  2. How often is the street read and/or write? Read/write ratio.
    1. Is it better to duplicate the data - to store street separately and within the full json - as it's often read?
    2. Is it better to store the whole address together because of often updates outnumbering the reading?
Let say you will ignore these questions and provide all methods required from user points of view. You just allow to fetch street and address in two different methods. Let say there is also functional requirement to fetch zip code from the address. Developers who are not familiar with repository internals will typically use the method to fetch street followed by the fetch of zip code on the next line. That's because it's natural thinking: to compose methods on API. However, this is obviously inefficient because of two remote calls to the storage.

If you answer similar questions you can easily make the decision that the only reasonable implementation is to provide getAddress only - to return the whole address aggregate. All developers now have no other chance that to use this method and use address as a whole.

You just define the repository API in most efficient way, you just tell developers how to use underlying persistence.

Implemenation

Once we know what kind of methods to place on repository API, there are some implementation constraints it worth to mention.

Repository is not a Map

... so do not try to express CRUD methods like some remote (hash)map

Every programmer, and probably man himself, starves for patterns and solves problems according to his or her past/current experience. CRUD using key/value store sounds like an application of a map. This idea almost implies the repository interface can probably reflect map interface for both method arguments and returning types.

However, there are certain circumstances, you need keep in mind.

1. Error States

In-memory map just CRUD's or not. In case of GET, there is a key or not. Repository on the other hand does remote calls using unreliable network to unreliable (set of) node(s). Therefore there is a broad range of potential issues you can meet.

2. Degraded Access

Look at Maps' DELETE. The remove method returns an entity being removed. Well, in case of map, it's just fine. On the other hand, it seems like overhead in case of repository considering slow network access. I'm not saying anything about stuff like consensus or QUORUM evaluation. It's not cheap. I've also doubts whether someone would use this returning value. He just needs to remove an entity via identifier.

Excluding simple in-memory implementations, the repository methods usually perform one or more remote calls. Contrary to local in-memory calls, those remotes use slow network system under the hood. What is the implication? Considering GET method, there are other states than a key does exist/not-exist. Or, returning current value in the case of REMOVE a key can take a time.

Optimistic Locking

Basically, every our entity contains long version used for CAS-like operation - optimistic locking. Contention thus happens on storage system itself. It's up to the system or up to the query how and what to do. Especially in distributed system this is kind of problem you do not want to solve.

Most of NoSQL storages use light approach usually called compare-and-set. Redis itself supports non-blocking transactions via MULTI, EXEC and WATCH primitives. Cassandra uses different approach built in query language support.

Java Optional<T> as Returning Type

We have eventually decided to use use java's Optional<T> so our API does not return any null. However, there is one exception in method with tree-state resulting type. Here is nice discussion on stackoverflow regarding where to use and where do not to use this syntax.

However, the implementation later approved this idea as a good approach. The point here is that everyone who use a method with Optional returning type is much more aware of null state, or Optional.Empty for record. I found out during the refactoring that 40% of code which used previous repository version (in memory) did not handle null as valid returning type.

Generic Repository API Interface

We eventually ended up with following API.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/**
 * @throws timeout exception
 * @throws generic persistence exception
 */
interface Repository<T> {

    /**
     * @throws if entity already exists
     */
    void add(T entity);
    /**
     * @return {@link java.util.Optional} when found, {@link java.util.Optional#EMPTY} otherwise
     */
    Optional<T> get(Id id);

    /**
     * @return {@link java.util.Optional} when found and persisted version is different given <code>version</code>, 
     * {@link java.util.Optional#EMPTY} otherwise if there is no persisted entity for given <code>version</code>,
     * <code>null</code> when found and persisted <code>version</code> is same as provided one
     */
    Optional<T> getIfNotMatched(Id id, long version);

    boolean exist(Id id);

    /**
     * Persist given <code>entity</code> and increment it's version when succeeded
     * @throws stale entity exception when given entity' version is different than persisted one
     * @throws if entity not exist
     */
    void update(T entity);

    /**
     * Deletes the whole entity hierarchy including all children
     * @throws stale entity exception when given entity' version is different than persisted one
     * @throws if entity not exist
     */
    void delete(Id id, long version);
}

Comments

Popular posts from this blog

NHibernate performance issues #3: slow inserts (stateless session)

The whole series of NHibernate performance issues isn't about simple use-cases. If you develop small app, such as simple website, you don't need to care about performance. But if you design and develop huge application and once you have decided to use NHibernate you'll solve various sort of issue. For today the use-case is obvious: how to insert many entities into the database as fast as possible?

Why I'm taking about previous stuff? The are a lot of articles how the original NHibernate's purpose isn't to support batch operations, like inserts. Once you have decided to NHibernate, you have to solve this issue.

Slow insertion
The basic way how to insert mapped entity into database is:
SessionFactory.GetCurrentSession().Save(object);But what happen when I try to insert many entities? Lets say, I want to persist
1000 librarieseach library has 100 books = 100k of bookseach book has 5 rentals - there are 500k of rentals It's really slow! The insertion took exactly

Java, Docker, Spring boot ... and signals

I spend last couple weeks working on java apps running within docker containers deployed on clustered CoreOS machines. It's pretty simple to run java app within a docker container. You just have to choose a base image for your app and write a docker file.

Note that docker registry contains many java distributions usually based on open jdk. We use our internal image for Oracle's Java 8, build on top of something like this docker file. Once you make a decision whether oracle or openjdk, you can start to write your own docker file.

FROM dockerfile/java:oracle-java8
ADD your.jar /opt/your-app
ADD /dependencies /opt/your-app/dependency
WORKDIR /opt/your-app
CMD ["java -jar /opt/your-app/your.jar"]
However, your app would probably require some parameters. Therefore, last line usually calls your shell script. Such script than validates number and format of those parameters among other things. This is also useful during the development phase because none of us want to build …

Git on Windows: MSysGit

I have started to use Git today. I read a lot of discussions that there is no good tool for Windows platform. After forethought I have decided to used TortoiseGit. I also feared of difficult work related with Git as a lot of articles mentioned many instructions. As I already said, I have decided to use TortoiseGit, because I'm used to work with TortoiseSvn, but for start, MSysGit is enought. So this article is about MSysGit, next will be about TortoiseGit.

How to start with MSysgit on local machine?
Download and install Git for WindowsCreate source code directory for your git appRight click the directory at your favorite file browser. Menu should contain item "Git init here". It initializes chosen directory to be git-abled :-)It was your first usage of Git.

Commit data to local Git repository

Now, you can add any file, your first source code, to created directory. If you are prepared to commit any changes to your local git repository, follow next instructions.
Right-click th…