Takipi – Takipi Blog

Written by Venky Ganesan, our new partner. @venkyganesan

We’re super excited to join Menlo Ventures portfolio!

—

Growing up, my biggest fears were the monsters under my bed. While I could not see them and never heard them, I always feared that they would come out when I was sleeping and hurt me. For a long time, I always slept with my cricket bat so that I could be ready for action if I needed to be. The adult equivalent of monsters for developers is bugs in their production code. Today, we are excited to announce Menlo’s lead investment in the Series A of Takipi, which has developed an unconventional way to crush these bugs while code is in production.

Increasingly, applications on the web interact with other applications using API’s and leverage 3^rd party web services for much of their functionality (Google Maps, for example). When production code breaks, developers fly blind. They do not know if it’s a problem on their end or in one of the API’s they are using or even which build caused the problem. Production code today is a black box. Conventional modes of debugging production code are very painful – it involves rolling back the changes and then trying to recreate the bugs in a dev/test environment. Unfortunately many times, the bugs cannot be recreated. Installing a debugger in production is not an option since conventional wisdom believed that would create significant performance and latency issues and unacceptably slow down the application.

Takipi’s new Chairman of the board Avery More

The amazing technical team at Takipi (Tal, Iris, Dor, Chen, and Niv) will suit up to go to battle on only the toughest monsters (aka, problems Takipi helps debug in production). Takipi’s cloud-based architecture allows you to debug production code without performance impact. You can find out more at Takipi (www.takipi.com). Don’t take my word for it, download and play with the code yourself and prepare to be amazed.

At Menlo, we believe strongly in backing missionaries not mercenaries. We want people who worship at the altar of product, who will paint the back of the fence even if no one will ever look at it. The Takipi founders are missionaries, and we feel very honored to be their partners on this journey.

Our main investing thesis at Menlo is the Right Now Economy – the convergence of mega-trends in mobile, social, cloud and big data that allow us to make decisions in real time. Information can be accessed, analyzed and acted upon instantaneously, and Takipi is one of the companies enabling this revolution. Developers need to know what’s going on in their production code RIGHT NOW. Takipi does that. So monsters, I am not afraid of you anymore. No more cricket bat needed just Takipi.

Takipi’s Founding team: Dor Levi, Chen Harel, Iris Shoor, Niv Steingarten, Tal Weiss

Developers: Takipi tells you when new code breaks in production – Learn more

Log4J vs SLF4J simple vs Logback vs Java Util Logging vs LOG4J2

Logging is an age-old and intrinsic part of virtually every server-side application. It’s the primary method by which applications output live state in a persistent and readable manner. Some applications may only log a few megabytes a day, while others may log gigabytes of data or more in a matter of hours.

As logging usually involves IO to write data to disk (either blocking or async) – it comes at a cost. When logging large amounts of data over short periods of time, that cost can ramp up quickly. We decided to take a deeper look at the speed of some of today’s leading logging engines.

Most developers log data for three main reasons –

1. Monitoring – to see how code behaves in terms of throughput, scale, security, etc..

2. Debugging – to get access to the state that caused code to fail (variables, stack traces…). Takipi helps developers debug staging and production servers, and understand why code crashes and threads freeze.

3. Analytics – leverage live data from the app in order to derive conclusions about the way it’s being used.

Behind the facade. Most libraries today have logging built-in at key points in the code to provide visibility into their operations. To streamline this process and prevent different libraries from employing multiple logging methods in the same JVM, logging facades, which decouple code from the underlying engine, have come into the forefront. When we analyzed the top 100 software libraries for Java, SLF4J came up as the leading logging facade used by developers today.

The Competition

We decided to pick five of today’s most prominent logging engines, and see how they perform in a number of races. Now, before you take out the torches and pitchforks, I wanted to clarify that the point is not to say which is better, but to give a sense of the differences in throughput between the engines across a number of common logging tasks.

The Contestants

1. Log4J

2. Log4J2

3. Logback

4. SLF4J Simple Logging (SLF4J SL)

5. Java Util Logging (JUL)

The Race

We wanted to see how the engines compare across a set of standard logging activities. Each logging operation includes a timestamp and a thread ID as its context.

These are the races:

1. Logging a string constant

2. Logging the .toString() value of a POJO

3. Logging a throwable object

4. Logging a string constant without time/tid context

The Track

We decided to hold five heats for each race to determine the best score, measuring the number of logging operations completed. In each test we gave the logging engines a task to perform across 10 threads in the space of a minute (the tests ran separately). We then took out the 2 heats with biggest deviation and averaged the results of the remaining three.

Between each individual logging operation we gave the CPU some work to do to put some space between the logging operations (checking whether a small random number is prime). The engines are all running behind SLF4J using their default configuration. The benchmarks were run on an Amazon m1.large EC2 instance.

Update: during our initial test Log4J2 was configured with a %C qualified class layout which increased its overhead. At the advice of @RemkoPopma we updated the configuration to %c (logger name) to conform with the other configuration, which gave Log4J2 a considerable performance boost, as you’ll see below. It’s definitely something worth paying attention to, and really highlights the cost of logging context data.

The Results

Pre conf change (%C):

Post conf Change (%c):

To see the full dataset – Click here.

Race #1 – String constants

In this race the engines are logging a string constant along with thread and timestamp context. Log4J comes out the clear winner here, being able to write almost 270% more lines than JUL, 12.5% more than logback, 52% more than SLF4J SL. It was interesting to note that before we changed Log4J2’s configuration it was able to write 4X(!) less lines, with the switch boosting it up #3 with only 30% less lines written than logback.

Race #2 – .toString()

In this race the engines are logging a POJO (via its .toString) along with thread and timestamp context. The results here were much closer with Log4J2 coming in at #1 with a 25% advantage (post change) over SLF4J SL coming in at #2. Log4J and Logback are neck and neck for the #3 spot with JUL taking silver with 88% throughput of SLF4J SL.

Race #3 – Throwable

In this race the engines are logging an exception object and a description string along with thread and timestamp context. It’s in this race the Log4J2 is on fire, coming in at #1 logging more than 3X (!) times the rows when compared SLF4J SL at #5.

Log4J and Logback are also left in the dust, logging less than half the lines of our esteemed winner. JUL comes in at a solid #2, logging 82% of the lines compared to our winner – not too bad.

Race #4 (running barefoot) – .toString() minus context

When dealing with server logs, each entry’s context (e.g. thread ID, class context, time-stamp, etc…) is almost as important as the content of the entry itself. For the previous races we used two of the most common context elements you’ll find in most server log entries – thread ID and timestamp. We thought it’d be interesting to analyze the overhead of those by running a .toString() race without using any of the engines’ context appenders.

Log4J2 is the winner here (post conf change, getting a 180% boost) with a clear 25% lead over by both Logback and JUL. SLF4J SL are trailing behind. It was puzzling to see that across the five different heats, SLF4J SL did better with the appenders than without (would love to hear your thoughts on it in the comments).

Log4J saw the biggest bump with a 15% increase in throughput. JUL, while not performing as well as Log4J or Log4J2 in this race, deliver almost the exact same results with and without the context data.

I’d love to hear your comments and inputs. You can also check out the code on GitHub.

The 7 Log Management Tools Java Developers Should Know – read more

CI – Know when your code slowed down after deploying a new version – read more

Developers: Takipi tells you when new code breaks in production – Learn more

We’re all used to employing reflection in our everyday work, either directly, or through frameworks that leverage it. Its a main aspect of Java and Scala programming that enables the libraries we use to interact with our code without hard-coded knowledge of it. But our use of reflection is only limited to Java and Scala code that runs inside the JVM. What if we could use reflection to look not only into our code at run-time, but into the JVM’s code as well?

When we began building Takipi, we looked for a way to efficiently analyze JVM heap memory to enable some low-level optimizations, such as scanning the address space of a managed heap block. We came across many interesting tools and capabilities to examine various aspects of the JVM state, and one of them does just that.

It’s one of Java’s strongest and most low-level debugging tools – The Java Serviceability Agent. This powerful tool comes with the HotSpot JDK and enables us to see not only Java objects inside the heap, but look into the internal C++ objects comprising the JVM itself, and that’s where the real magic begins.

Reflection ingredients. When dealing with any form of reflection to dynamically inspect and modify objects at runtime, two essential ingredients are required. The first one is a reference (or address) to the object you want to inspect. The second one is a description of the object’s structure, which includes the offsets in which its fields reside and their type information. If dynamic method invocation is supported, the structure would also contain a reference to the class’s method table (e.g. vtable) along with the parameters each one expects.

Java reflection itself is pretty straightforward. You obtain a reference to a target object just like you would with any other. Its field and method structures are available to you via the universal Object.getClass method (originally loaded from the class’s bytecode). The real question is how do you reflect the JVM itself?

The keys to the castle. Wonderfully enough, the JVM exposes its internal type system through a set of publicly exported symbols. These symbols provide the Serviceability agent (or anyone else for that matter) with access to the structures and addresses of the internal JVM class system. Through these one can inspect almost all aspects of the internal workings of the JVM at the lowest level, including things like raw heap addresses, thread/stack addresses and internal compiler states.

Reflection in action. To get a sense of the possibilities, you can see some of these capabilities in action by launching the Serviceability Agent’s HotSpot Debugger UI. You can do this by launching sa-jdi.jar with sun.jvm.hotspot.HSDB as the main class argument. The capabilities you’ll see are the same ones that help power some of the JVM’s most powerful debugging tools such as jmap, jinfo and jstack.

* HSDB and some of the extremely low-level inspection capabilities it offers into a target JVM.

How it’s done. Let’s take a closer look to understand how these capabilities are actually provided by the JVM. The cornerstone for this approach lies with the gHotSpotVMStructs struct which is publicly exported by the jvm library. This struct exposes both the internal JVM type system as well as the addresses of the root objects from which we can begin reflecting. This symbol can be accessed just like you would dynamically link with any publicly exported OS library symbol via JNI or JNA.

The question then becomes how do you parse the data in the address exposed by the gHotSpotVMStructs symbol? As you can see in the table below, the JVM exposes not only the address of its type system and root addresses, but also additional symbols and values that provide you with the values needed to parse the data. These include the class descriptors and binary offsets in which every field within a class is located.

* A dependency walker screenshot of the symbols exposed by jvm.dll

The manifest. The gHotSpotVMStructs structure points to a list of classes and their fields. Each class provides a list of fields. For each field the structures provide its name, type and whether its a static or non-static field. If it’s a static field the structure would also provide access to its value. In case of a static object type field, the structure would provide the address of the target object. This address serves as a root from which we can begin reflecting a specific component of the internal JVM system. This includes things like compiler, threading or collected heap systems.

You can checkout the actual algorithm used by the Serviceability agent to parse the structure in the Hotspot JDK code here.

Practical examples. Now that we’ve got a broad sense of what these capabilities can do, let’s take a look at some concrete examples of the types of data exposed by this interface. The folks who built the SA agent went through a lot of trouble to create Java wrappers around most of the classes provided by the gHotSpotVMStructs table. These provide a very clean and simple API to access most parts of the internal system in a way that is both type safe and hides most of the binary work required to access and parse the data.

To give you a sense of some of the powerful capabilities exposed by this API, here are some references to the low-level classes it provides –

VM is the singleton class which exposes many of the JVM’s internal systems such as the thread system, memory management and collection capabilities. It serves as an entry point into many of the JVM subsystems and is a good place to start when exploring this API.

JavaThread gives you an inside look at how the JVM sees a Java thread from the inside, with deep information into frames locations and types (compiled, interpreted, native…) as well as actual native stack and CPU register information.

CollectedHeap lets you explore the raw contents of the collected heap. Since HotSpot contains multiple GC implementations, this is an abstract class from which concrete implementations such as ParallelScavengeHeap inherit. Each provides a set of memory regions containing the actual addresses in which Java objects reside.

As you look at the implementation of each class you’ll see it’s essentially just a hard coded wrapper using the reflection-like API to look into the JVM’s memory.

Reflection in C++. Each of these Java wrappers is designed as an almost complete mirror of an internal C++ class within the JVM. As we know C++ doesn’t have a native reflection capability, which raises the question of how that bridge is created.

The answer lies in something very unique which the JVM developers did. Through a series of C++ macros, and a lot of painstaking work, the HotSpot team manually mapped and loaded the field structures of dozens of internal C++ classes into the global gHotSpotVMStructs. This process is what makes them available for reflection from the outside. The actual field offset values and layouts are generated at the JVM compile time, helping to ensure the exported structures are compatible with the JVM’s target OS.

Out-of-process connections. There’s one more powerful aspect to the Serviceability agent that’s worth taking a look at. One of the coolest capabilities the SA framework provides is the ability to reflect an external live JVM from out-of-process. This is done by attaching the Serviceability agent to the target JVM as an OS level debugger. Since this is OS dependent, for Linux the SA agent framework will leverage a gdb debugger connection. For Windows it will use winDbg (which means Windows Debugging Tools will be needed). The debugger framework is extensible, which means one could use another debugger by extending the abstract DebuggerBase class.

Once a debugger connection is made, the return address value of gHotSpotVMStruct is passed back to the debugger process which can (by virtue of the OS) begin inspecting and even modifying the internal object system of the target JVM. This is exactly how HSDB lets you connect and debug the a target JVM, both the Java and JVM code.

* HSDB’s interface exposes the SA agent’s capability of reflecting a target JVM process

I hope this piqued your interest. From my own personal perspective, this architecture is one of my most favorite pieces of the JVM. Its elegance and openness are in my view definitely things to marvel at. It was also super helpful to us when we were building some of Takipi’s real-time encoding parts, so a big tip of the hat to the good folks who designed it.

Ever used one of these APIs in your code? I’d love to hear about it, or answer any questions you may have in the comments below.

More stuff from Takipi:

GitHub’s 10,000 most Popular Java Projects – Here are The Top Libraries They Use – read more

How to tell if an exception is coming from new code – read more

Analyze and debug exceptions – directly from New Relic – read more

Lambda expressions have taken the programming world by storm in the last few years. Most modern languages have adopted them as a fundamental part of functional programming. JVM based languages such as Scala, Groovy and Clojure have integrated them as key part of the language. And now, Java 8 is (finally) joining in on the fun.

What’s interesting about Lambda expressions is that from the JVM’s perspective they’re completely invisible. It has no notion of what an anonymous function or a Lambda expression is. It only knows bytecode which is a strict OO specification. It’s up to the makers of the language and its compiler to work within these constraints to create newer, more advanced language elements.

We first encountered this when we were working on adding Scala support to Takipi and had to dive deep into the Scala compiler. With Java 8 right around the corner, I thought it would be interesting to see how the Scala and Java compilers implement Lambda expressions. The results were pretty surprising.

To get things going I took a simple Lambda expression that converts a list of Strings to a list of their lengths.

In Java –


List names = Arrays.asList("1", "2", "3");
Stream lengths = names.stream().map(name -> name.length());

In Scala –


val names = List("1", "2", "3")
val lengths = names.map(name => name.length)

Don’t be tricked its simplicity – behind the scenes some complex stuff is going on.

Let’s start with Scala

The Code

I used javap to view the bytecode contents of the .class produced by the Scala compiler. Let’s look at the result bytecode (this is what the JVM will actually execute).


// this loads the names var into the stack (the JVM thinks
// of it as variable #2).
// It’s going to stay there for a while till it gets used
// by the <em>.map</em> function.

aload_2

Next, things gets more interesting – a new instance of a synthetic class generated by the compiler is created and initialized. This is the object that from the JVM’s perspective holds the Lambda method. It’s funny that while the Lambda is defined as an integral part of our method, in reality it lives completely outside of our class.

new myLambdas/Lambda1$$anonfun$1 //instantiate the Lambda object
dup //put it into the stack again

// finally, invoke the c’tor. Remember - it’s just a plain object
// from the JVM’s perspective.
invokespecial myLambdas/Lambda1$$anonfun$1/()V

// these two (long) lines loads the immutable.List CanBuildFrom factory
// which will create the new list. This factory pattern is a part of
// Scala’s collections architecture
getstatic scala/collection/immutable/List$/MODULE$
Lscala/collection/immutable/List$;
invokevirtual scala/collection/immutable/List$/canBuildFrom()
Lscala/collection/generic/CanBuildFrom;

// Now we have on the stack the Lambda object and the factory.
// The next phase is to call the .<em>map</em>() function.
// If you remember, we loaded the <em>names</em> var onto
// the stack in the beginning. Now it’ll gets used as the
// this for the .<em>map</em>() call, which will also
// accept the Lambda object and the factory to produce the
// new list of lengths.

invokevirtual scala/collection/immutable/List/map(Lscala/Function1;
Lscala/collection/generic/CanBuildFrom;)Ljava/lang/Object;

But hold on – what’s going on inside that Lambda object?

The Lambda object

The Lambda class is derived from scala.runtime.AbstractFunction1. Through this the map() function can polymorphically invoke the overridden apply() whose code is below –

// this code loads this and the target object on which to act,
// checks that it’s a String, and then calls another apply overload
// to do the actual work and boxes its return value.
aload_0 //load this
aload_1 //load the string arg
checkcast java/lang/String //make sure it’s a String - we got an Object

// call another apply() method in the synthetic class
invokevirtual myLambdas/Lambda1$$anonfun$1/apply(Ljava/lang/String;)I

//box the result
invokestatic scala/runtime/BoxesRunTime/boxToInteger(I)Ljava/lang/Integer
areturn

The actual code to perform the .length() operation is nested in that additional apply method which simply returns the length of the String as we expected.

Phew.. it was quite a long way to get here

aload_1
invokevirtual java/lang/String/length()I
ireturn

For a line as simple as we write above, quite a lot of bytecode is generated – an additional class and a bunch of new methods. This of course isn’t meant to dissuade us from using Lambdas (we’re writing in Scala, not C). It just goes to show the complexity behind these constructs. Just think of the amount of code and complexity that goes into compiling complex chains of Lambda expressions!

I was quite expecting Java 8 to implement this the same way, but was quite surprised to see that they took another approach completely.

Java 8 – A new approach

The bytecode here is a bit shorter but does something rather surprising. It begins quite simply by loading the names var and invokes its .stream() method, but then it does something quite elegant. Instead of creating a new object that will wrap the Lambda function, it uses the new invokeDynamic instruction which was added in Java 7 to dynamically link this call site to the actual Lambda function.

aload_1 //load the names var

// call its stream() func
invokeinterface java/util/List.stream:()Ljava/util/stream/Stream;

//invokeDynamic magic!
invokedynamic #0:apply:()Ljava/util/function/Function;

//call the map() func
invokeinterface java/util/stream/Stream.map:
(Ljava/util/function/Function;)Ljava/util/stream/Stream;

InvokeDynamic magic. This JVM instruction was added in Java 7 to make the JVM less strict, and allows dynamic languages to bind symbols at run-time, vs. doing all the linkage statically when the code is compiled by the JVM.

Dynamic Linking. If you look at the actual invokedynamic instruction you’ll see there’s no reference of the actual Lambda function (called lambda$0). The answer lies in the way invokedynamic is designed (which in itself deserves a full post), but the short answer is the name and signature of the Lambda, which in our case is –

// a function named lamda$0 that gets a String and returns an Integer
lambdas/Lambda1.lambda$0:(Ljava/lang/String;)Ljava/lang/Integer;

are stored in an entry in a separate table in the .class to which the #0 parameter passed to the instruction points. This new table actually changed the structure of the bytecode specification for the first time after a good few years, requiring us to adapt Takipi’s error analysis engine to it as well.

The Lambda code

This is the code for the actual Lambda expression. It’s very cookie-cutter – simply load the String parameter, call length() and box the result. Notice it was compiled as a static function to avoid having to pass an additional this object to it like we saw in Scala.

aload_0
invokevirtual java/lang/String.length:()
invokestatic java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
areturn

This is another advantage of the invokedynamic approach, as it allows us to invoke the method in a way which is polymorphic from the .map() function’s perspective, but without having to allocate a wrapper object or invoke a virtual override method. Pretty cool!

Summary. It’s fascinating to see how Java, the most “strict” of modern languages is now using dynamic linkage to power its new Lambda expressions. It’s also an efficient approach, as no additional class loading and compilation is needed – the Lambda method is simply another private method in our class.

Java 8 has really done a very elegant job in using new technology introduced in Java 7 to implement Lambda expressions in what is a very straightforward way. It’s pleasant in a way to see that even a “venerable” lady such as Java can teach us all some new tricks

This post is also available in German, Spanish, French and Portuguese

Takipi detects all your exceptions and errors and tells you why they happen. Even across multiple threads and machines. Installs in 1min. Less than 2% overhead – Try it free

Java developers – Deploy Takipi now and get a free T-shirt

Java 8 Exceptions have never been so beautiful – Try Takipi for Java 8

As big fans of the book “Java Puzzlers” by Joshua Bloch and Neal Gafter, we’re always looking for new Java puzzles to solve. Since we haven’t seen good ones in a while, and as a tribute to their Java brainteasers we’ve decided to add a puzzle of our own. In their own words: “By working through the puzzles, you will become less likely to fall prey to these dangers in your code and more likely to spot them in code that you are reviewing or revising”. And besides, it’s also really fun. To spice it up a bit more we’ve added a $250 Amazon gift card to the one developer who manages to solve it. In one move. And give the right explanation as to what exactly happened there. We hope your kung fu is strong today.

The basics

Two teams are playing basketball, each is scoring points against the other and then suddenly – boom and game over. The project structure is pretty simple:


.
+-- Main.java
+-- simulator
|   +-- MatchSimulator.java
|   +-- SimulationVerifier.java
+-- data
|   +-- BoxScore.java
|   +-- BasketballMatch.java

Two game simulator threads invoked by the main class, a simple data structure to hold the scores and the rest is up to you.

Instructions

1. Head over to: https://github.com/takipi/puzzle-bballsim.
2. Clone, compile, run and see it crash and burn.
3. Solve the bug by changing ONE single keyword/token/identifier: it could be by replacing it, adding one, or deleting one.
4. Explain what caused the bug and how your solution fixes it.
5. Send us the solution and explanation to hello@takipi.com by ~~Friday September 12, 12pm (PDT)~~.

Update (12/9): We’d like to thanks everyone who participated and attempted to solve the puzzle, hope you had a good time! You can find the correct solution here – follow us on twitter to get updates on new puzzles.

** Please note that the explanation will be a crucial part of the answer. A wrong or an incomplete explanation may disqualify you even if you fixed the bug.

In case there’s more than one developer who solves the puzzle correctly, we’ll randomly choose one winner. In any case, the right answer will be published on the following week and all those who got the right answer will receive their recognition.

If you have any questions, please post them in the comments section below. Oh, and if you reach an answer, please don’t spoil it for everyone. At least not before the deadline. Good luck!

Java Developers – Know when and why production code breaks – read more

Already using Java 8? Try Takipi for Java 8

Here at Takipi we’re in the error tracking business. Each day we track more than 500,000 errors coming from hundreds of different companies. Errors across different machines, multithreaded errors, errors involving 3rd party libraries, you name it. Our goal building Takipi was to track code that led to errors and slowdowns. We dreamed of a place where we could view information that was not available before, even in log files.

Dreams Do Come True – Our New Errors Viewer

Takipi records variable values as they were in each frame of the stack. We were looking for a way to allow users to easily scan the stack and spot the variables that caused the error. How cool would it be, we thought, if we could scan the code that led to an error with a single scroll? Pretty cool. So we built it – A new way to view server errors:

Looking Inside the Monster

On the left of this new screen you can see the call stack that led to an exceptions or a logged error (screenshot below). Frames with a glasses icon represent frames with recorded variables. To reduce noise, Takipi only records variable values that might have led to the error. Frames marked with a pencil icon contain code that was recently modified. Since new bugs often crawl out of new code, highlighting these parts can help tracking them.

If one machine invoked an API call on another machine which caused an error, you’ll be able to see the entire path. The cross-machine analysis works if you’re sending an HTTP request using Java’s standard APIs or any popular 3rd party library.

The Prime Suspects

Scooting over to the right – the prime suspects. Here you can see your code and get the exact variable values that led to each error as they were when it happened. Quickly scroll over all the methods that led to the error. Recorded variables are highlighted and when hovering you can see the variable value as it was at the moment of the error. If it’s a complex object with fields, you can view them as well.

How do we display code? To provide max security for source code, Takipi encrypts the user’s code using a 256-bit secret AES key that can be generated locally.

The Rest is History

If this exception happened more than once you can toggle between different recordings and compare the stack and variable values. For exceptions that happen dozens, hundreds, thousands or millions of times, we limit the recordings to a few dozen in order to keep the overhead minimal.

This post is now in Spanish.

You can sign up for a free trial here.

What about slowdowns? Good thing you asked. We’re currently designing Takipi slowdown analysis: building a new concept for highlighting code which led to slowdowns. If you’d like to review it in a closed beta and give us some feedback, contact us at hello@takipi.com. It’s super interesting – promise!

Try the new UI now:

What are the most extraordinary pieces of Java code you’ve had the chance to write? Share your story and win a Kano – A Raspberry pi DIY computer kit

Having so much Java code written every day can produce quite a lot of quirks and trouble. It can make you laugh, it can make you cry, but when it finally works it sometimes feels like magic. In this post we’re looking hear your stories from the trenches of your IDE: What’s the most useful debugging trick you use? What are some of the things you do that most developers aren’t aware of? How did you manage to solve that issue that was bugging you for way too long?

Share your Java war stories with us and win a Kano computer kit

– We’ll go over the stories and select the ones we feel are the most creative, interesting and smart.
– The submissions we like most will be published on the Takipi blog, reaching tens of thousands of developers.
– If we decide to publish your story, you will be contacted to provide a few extra details for the post – And win a Kano kit worth $150.
– Code samples are strongly encouraged (please make sure it’s ok for us to use them in the post).
– Submit your story using the form below or through email: hacks@takipi.com.

The Kano DIY computer kit

Kano is a Raspberry Pi based computer that you make yourself (a Raspberry Pi can also runs Java by the way, so you can keep hacking on other devices as well). We’ll contact a few selected participants and keep it rolling from there – please elaborate as much as you can and don’t forget to include relevant code samples.

What makes a good submission?

On our previous posts we’ve shared some of the things we learned in Takipi like reducing GC overhead, Logback tweaks, and most recently, debugging with jstack. jstack is a lightweight tool that shows the stack traces of all Java threads running within a target JVM just by pointing it to a process via pid. With a few more tweaks described in the post, you will be able to add extra thread state data like its purpose and transaction IDs for example. In addition, although traditionally its a standalone tool, we describe how you can invoke it programmatically whenever a custom condition you define happens (like a throughput decrease for example):

public void startScheduleTask() {

    scheduler.scheduleAtFixedRate(new Runnable() {
        public void run() {

            checkThroughput();

        }
    }, APP_WARMUP, POLLING_CYCLE, TimeUnit.SECONDS);
}

private void checkThroughput()
{
    int throughput = adder.intValue(); //the adder in inc’d when a message is processed

    if (throughput < MIN_THROUGHPUT) {
        Thread.currentThread().setName("Throughput jstack thread: " + throughput);
        System.err.println("Minimal throughput failed: exexuting jstack");
        executeJstack(); //see the code on github to see how this is done
    }

    adder.reset();
}

The full code samples are available on GitHub.

Another cool example we ran into comes from Siddharth Anand on Quora. Have you ever heard about thread migration from machine to machine? Because that’s exactly what Siddarth was experiencing when he implemented a server using RMI (Remote Method Invocation). Due to a bug, the RMI server returned the thread itself instead of returning the result of its computation. This means the thread started on the server machine, serialized over the network, and completed its execution on the client machine – resulting in threads that do machine travel, whoa. This bug was identified and fixed by printing out some debug messages which then were printed both to the server and to the client after the thread jumped over the network.

(edit: Just had to squeeze this one in, a Java multiline string implementation using comments & annotations. Mind. Blown: https://github.com/benelog/multiline/)

Here’s a template to make your submission easier, you can either use the form or send us an email to hacks@takipi.com:

Loading…

** Can’t see the questions? Click here

** Answers can also be sent by email to hacks@takipi.com

5 Error Tracking Tools Java Developers Should Know
Read more

Takipi detects all errors in production and shows the variable values as if you were there when it happened
Deploy now and get a free T-shirt

blog_tshirt

The ultimate survival kit for new deployments

Unlike toying around with zombie apocalypse scenarios, debating the machete versus the shotgun, troubles in Java production environments are quite real, especially after new deployments (but it’s good to be ready for zombies as well). Taking this a step further, it’s much easier to get into trouble today than ever before when new code shipping cycles are cut down to weeks and sometimes days or even multiple times a day. To avoid being run down by the zombies, here’s the survival kit setup you need to fully understand the impact of new code on your system. Did anything break? Is it slowing you down? And how to fix it? Here’s the tool set and architecture to crack it once and for all.

Logging

Other than shrinking release cycles, another property of the modern development lifecycle is ever expanding log files that can reach GBs per day. Let’s say some issue arises after a new deployment: If you’d like to produce a timely response, dealing with GBs of unstructured data from multiple sources and machines is close to impossible without the proper tooling. In this space we can essentially divide the tools to the heavy duty enterprise on-premise Splunk, and its SaaS competitors like Sumo Logic, Loggly and others. There are many choices available with a similar offering so we wrote a more in-depth analysis of log management that you can read right here.

Takeaway #1: Set up a sound log management strategy to help you see beyond the pale lines of bare logfiles and react fast after new deployments.

One logging architecture we’ve found to be super useful after deploying new code is the open-source ELK stack. It’s also worth mentioning since it’s open-source and free.

The ELK Stack: ElasticSearch, Logstash and Kibana

So what is this ELK we’re talking about? A combination of elasticsearch’s search and analytics capabilities, Logstash as the logs aggregator and Kibana for the fancy dashboard visualization. We’ve been using it for a while, feeding it from Java through our logs and Redis, and it’s in use both by developers and for BI. Today, elasticsearch is pretty much built-in with Logstash, and Kibana is an elasticsearch product as well, making integration and setup go easy peasy.

When a new deployment rolls out, the dashboards follow custom indicators that we’ve set up about our apps health. These indicators update in real time, allowing close monitoring when freshly delivered code takes its first steps after being uploaded to production.

Takeaway #2: Search, visualization and the ease of aggregating logs from multiple sources are key factors in determining your log management strategy.

Takeaway #3: From a developer perspective, evaluating the impact of a new deployment can include BI aspects as well.

Tools to check:

1. On-premise: Splunk
2. SaaS: Sumo Logic
3. SaaS: Loggly
4. Open source: Graylog2
5. Open source: Fluentd
6. The ELK stack (Open source): Elasticsearch + Logstash + Kibana

Performance Monitoring

So the release cycles are cutting down and log files are becoming larger, but that’s not all: The number of user requests grows exponentially and they all expect peak performance. Unless you work hard on optimizing it, simple logging will only take you this far. With that said, dedicated Application Performance Management tools are no longer considered to be a luxury and rapidly become a standard. At its essence, APM means timing how long it takes to execute different areas in the code and complete transactions – this is done either by instrumenting the code, monitoring logs, or including network / hardware metrics. Both at your backend and on the users’ devices. The first two modern APM tools that come to mind are New Relic, who just recently filed their IPO, and AppDynamics.

AppDynamics on the left, New Relic on the right – Main dashboard screen

Each traditionally targeted a different type of developer, from enterprises to startups. But as both are stepping forward to their IPOs and after experiencing huge growth the lines are getting blurred. The choice is not clear, but you could not go wrong – On premise = AppDynamics, otherwise, it’s an individual call depends on which better fits your stack (and which of all the features they offer are you actually thinking you’re going to use). Check out the analysis we recently released that compares these two head to head right here.

Two additional interesting tools that were recently released are Ruxit (by Compuware) and DripStat (by Chronon Systems), each coming from larger companies with their own attempt to address the SaaS monitoring market pioneered by New Relic. Looking into hardcore JVM internals, jClarity and Plumbr are definitely worth checking out as well.

Takeaway #4: New deployments may affect your application’s performance and slow it down, APM tools can provide an all around overview of your applications health.

Tools to check:

7. AppDynamics
8. New Relic

New players:

9. jClarity
10. Plumbr
11. Ruxit
12. Dripstat

Debugging in Production

Release cycles are down, log files grow large, user requests explode, and… the margin for error simply doesn’t exist. When an error does come – you need to be able to solve it right away. Large-scale production environments can produce millions of errors a day from hundreds of different locations in the code. While some errors may be trivial, others break critical application features and affect end-users without you knowing it. Traditionally, to identify and solve these errors you’d have to rely on your log files or a log management tool to even know an error occurred, let alone how to fix it.

With Takipi, you’re able to know which errors pose the highest risk and should be prioritized, and receive actionable information on how to fix each error.

Looking at errors arising after new deployments, Takipi addresses 3 major concerns:

1. Know which errors affect you the most – Detect 100% of code errors in production, including JVM exceptions and log errors. Use smart filtering to cut through the noise and focus on the most important errors. Over 90% of Takipi users report finding at least one critical bug in production during their first day of use.

2. Spend less time and energy debugging – Takipi automatically reproduces each error and displays the code and variables that led to it – even across servers. This eliminates the need to manually reproduce errors, saves engineering time, and dramatically reduces time to resolution.

3. Deploying without risk – Takipi notifies you when errors are introduced by a new version, and when fixed errors come back to haunt you.

Takeaway #5: With Takipi you’re able to act quickly to resolve any issue and no longer in the dark after a new release.

Tools to check:

13. Takipi

Alerting and tracking

Release cycles, log files, user requests, no margin for error and… how you’re going to follow up on it all? You might think this category overlaps with the other’s and the truth is that you’re probably right, BUT when all of these tools have their own pipelines for letting you know what went wrong – It gets quite cluttered. Especially in the soft spot after a new deployment when all kinds of unexpected things are prone to happen (which are gentler words for… all hell breaks loose).

One of the leading incident management tools that tackles this is PagerDuty: Collecting alerts from your monitoring tools, creating schedules to coordinate your team and deliver each alert to the right person through texts, emails, sms or push notifications.

Takeaway #6: Consider using an incident management system to handle information overload.

A specialized tool we really like using here is Pingdom (which also integrates with Pagerduty). What it does is quite simple and just works: Tracking and alerting on our website’s response times 24/7. Answering a crucial question that seems trivial: Is the website available? Probing it from different locations all over the globe.

All systems are go!

Another angle to tackle information overload is error tracking that goes beyond the features of log analyzers: Smart dashboards to manage your exceptions and log errors. Aggregating data from all your servers and machines to one single place, either through your log events or other plugs coming from your code. For a deeper dive to the error tracking tools landscape, check out this post that covers the most popular options.

Takeaway #7: Code errors come in all shapes and sizes, it’s worth giving them some special treatment with an error tracking tool (and smash some bugs while we’re at it, muhaha).

Tools to check:

14. PagerDuty
15. Pingdom

Conclusion

We’ve experienced first hand how modern software development affects the release lifecycle and zoomed in on how you can assess the impact of new rapid deployments – when new code can come in before you even fully understood the last update’s impact. In the grand scheme of things, any tool you consider should address these 5 characteristics:

Shrinking release cycles
Expanding log files
Growing user requests
Smaller margins for error
Information overload

Most importantly, think of how you’re handling these today and which takes up too much of your time. Chances are that there’s a tool to solve it.

This post is also available in Spanish

Know great Java hacks? Always wanted to share coding insights with fellow developers?
If you want to contribute to Takipi’s blog, and reach more than 80,000 developers a month, we’re looking for guest writers! Shoot us an email with your idea for a guest post: hello@takipi.com.

This post is also available in Spanish

Takipi detects all errors in production and shows the variable values as if you were there when it happened
Deploy now and get a free T-shirt

The sneaky bug that got an entire production environment down
Read more

Here at Takipi, we’re in the error tracking business. Each day, Takipi is used to track more than 500,000 errors across hundreds of different companies. The most critical and fragile stage for many apps is just after a new deployment – when code changes are tested for the first time under a high stress load and with full production settings. Takipi detects your code changes automatically, and 87% of users report finding new unknown bugs in production via Takipi within the first hour after deploying a new version.

Takipi analyzes your code at the JVM level, and doesn’t rely on pulling log files from your machine. This helps companies collect more data on problems in production with minimal added CPU and IO overhead.

New Post: Deploying Code Fast? Here’s How to Tell If You Broke Something http://t.co/g4YbDN9QYn pic.twitter.com/oEC5ShXJwh

— Takipi (@takipid) February 19, 2015

How can you deploy more safely?

Track all new server errors – uncaught and caught exceptions, logged errors and HTTP errors. Easily see a list of all errors that occurred for the first time after a new deployment. Takipi automatically identifies new deployments (based on changes to existing code or new code added) and tells you if exceptions were thrown from modified code.
Real-time analytics give you all the stats you need to decide whether an error is critical or not.
See how new deployments affect your code. Know if an exception that used to happen 10 times a day is now occurring 1000 times a day.
Make sure an error you patched was indeed fixed and doesn’t spring up again.

How does it work?

Takipi is a Java agent that monitors all production errors and shows all the code and variable values that led to them. Right after an error is detected, Takipi will display all the data you need to prioritize it and fix it.

Error Analysis in Takipi

See a list of errors that started after a new deployment

For each error, get the following stats:

Error location and root location.
How many times it happened and the fail rate.
Does it involve recently modified code?
Which server threw the error

View a sample error analysis right here.

See the exact stack and variable values as they were when the error occurred

Takipi’s core technology is around creating a full “replay” of each exception. You can see the entire call stack, including 3rd party methods if desired, and click on each method to view the variable values as they were when the exception happened. Takipi records all variable types and captures them up to 5 levels deep in the heap.

Compare an error between different code versions. Know if it happens more often than it used to

Takipi shows you error trends – allowing you to compare the number of errors and fail rates across different deployments. If an error happened multiple times (the usual case), you can view historical records of the same error and compare values, even between different code versions.

Get daily trends summary – know if something bad started

Once a day (or at a frequency you decide) Takipi produces a summary of server errors that might indicate a critical problem for you to review.

These summaries show you a list of new errors that started today and where they’re coming from. Get highlights on errors that have increased dramatically and are now happening more frequently than before.

dashboard

View errors coming from 3rd party code.

Your code doesn’t exist in a vacuum. Sometimes, the reason for your code breaking lies with changes to 3rd party libraries. In these cases, it can take a long time to understand what happened. In Takipi, you can add monitoring for 3rd party libraries and install it on code bases like Hadoop/Spark/Kafka to discover exception coming from there. If an API becomes unavailable or slows down, you can know right away.

You’re 10 minutes away from discovering unknown errors. Install Takipi now to find them:

The top posts on the journey from 0 to 100k unique monthly visitors on the Takipi blog

It has been just over 2 years since the launch of the Takipi blog, and recently we celebrated a nice milestone – crossed 100k readers a month.

New Post: Takipi Blog: Top 10 Posts of All Time http://t.co/MUtmbrQtyl pic.twitter.com/F7RlBMDpRY

— Takipi (@takipid) July 9, 2015

When publishing new posts, we always try to guess if they’ll get traction, become popular, or start a live discussion between industry experts and the people behind the tools and technologies we’re writing about. In some cases, we were able to anticipate the traction a post will receive, in other cases, the success was quite surprising. We thought this would be a good chance to celebrate this milestone, reflect and plan the road ahead by revealing the list of our most popular posts so far:

1. We Analyzed 30,000 GitHub Projects – Here Are The Top 100 Libraries in Java, JS and Ruby

Based on some data crunching over Github’s data, we were able to extract the most popular libraries that Java, JS and Ruby projects use. Since then we’ve re-ran the Java benchmark and also reported on more recent results.

2. Compiling Lambda Expressions: Scala vs Java 8

One of the hottest recent new features in Java are lambda expressions, which were already available for a while in functional programming languages like Scala. In this post we’ve taken a look at these two from the JVM perspective back when we added Scala support to Takipi – And survived to share the findings.

3. 5 Features in Java 8 That WILL Change How You Code

Lambda expressions were definitely the most hyped feature in Java 8. But not the only one of course. In this post we ran through some of the top features that most likely already changed the way Java developers code: With the streams and parallel operations, and the date and time API among others.

4. Sublime vs Atom: Text Editor Battles

Woah, this is a thorny one. With Github’s Atom recently reaching version 1.0, this dilemma got back to the spotlight (which btw, if you haven’t already – you just have the watch this video). And please don’t get us started with vim vs emacs.

5. 15 Tools to Use When Deploying New Code to Production

The production quality tooling landscape is going through some major changes. If not too long ago you’d pretty much be left in the dark after deploying your code to production, today the situation is completely different. In this post we’ve taken a look at the tools modern teams use to keep an eye out on what’s going in their production environment.

6. AppDynamics vs New Relic – Which Tool is Right For You? The Complete Guide

The Application Performance Monitoring (APM) toolkit is also going through some major changes since entrance of the SaaS model, and many teams struggle with making the right decisions for the tools they choose to use. Here we discuss the differences between AppDynamics and New Relic and hope to help you reach the right decision for your environment.

7. The Dark Side Of Lambda Expressions in Java 8

Back to our beloved lambdas, not everything is so shiny and bright. In this post we’ve examined some of the less talked about downsides to using lambdas in Java 8. When you look at the stack trace that errors with lambdas produce – things get a bit darker.

8. The 7 Log Management Tools You Need to Know

High scale systems produce tons of log data and it’s getting super hard to manage it without some log management tool in place. In this post we’ve showcased some of the more popular choices that you have in front of you today when making the call to keep your log files in check.

9. Garbage Collectors – Serial vs. Parallel vs. CMS vs. G1 (and what’s new in Java 8)

Garbage collection is definitely one of the more interesting topics of the JVM. With recent news around the debates of making G1 the default garbage collector in Java 9, it’s a great opportunity to catch up on the available garbage collectors and their features (yes, unlike popular belief, there’s more than one).

10. Java Bootstrap: Dropwizard vs. Spring Boot

This post takes on the question of how to get a production ready Java application off the ground in the shortest time possible. Favoring convention over configuration, and taking off through the decisions made by the 2 most popular lightweight Java frameworks today: Dropwizard and Spring Boot.

Conclusion

We’d like to give a huge huge shout out to the community, all the developers, authors, and industry experts that help us along the way. We’re happy to be able to help you make sense of the software development landscape from a hands-on developer perspective. Be it core Java, Scala, JVM internals, news, benchmarks, devops practices and tools – We’re hungry digging for insights and look forward to keep sharing our research with you.

Happy reading!

InfoQ interviews Tal Weiss: Debugging strategies for release cycles on steroids

Developer teams want to push code much faster and see its impact in production. Many times it’s also simply the only way for their companies to stay relevant and keep up with the competition. In order to be able to do that, you need to go beyond implementing Continuous Delivery / Continuous Integration practices and create a production debugging strategy to keep your application afloat.

New post: InfoQ Interview: Making Your #Java Application Production Ready http://t.co/glhTcnHEoq @InfoQ pic.twitter.com/LhmdFjo7n3

— Takipi (@takipid) August 19, 2015

Fast and furious debugging

In a recent interview with InfoQ’s lead Java editor Victor Grazi, Tal Weiss, Takipi’s CEO, shares some of his insights about the state of these new demands, and what has to be done in order to make your fast changing application – production proof:

“The main challenge we’ve seen is that companies don’t prepare in advance as to how they’re going to debug and monitor their situation. They ship; issues start happening and then it becomes this really kind of wild chase of trying to get the right tooling in, trying to understand what’s broken with the architecture, how to optimize things at the software level, the GC level or the JVM level.”

“The biggest advice that I can give to any company is make sure that while you’re building out the software, while you’re designing the new application, the new system, already have a monitoring, a debugging, a logging, a devops strategy in place that you build in tandem with the actual code that you’re writing. So come launch day, you will have a very good level of visibility into what’s going on. That will really help you and save you a lot of time when things break down, when you’re pushing your code and things happen, and you have the tooling and the methodologies in place.”

“These will also change and scale and mutate as the application grows and matures. But just having that frame of thought in from day one I think is something that is really a good practice to adopt.”

Logging is not what it used to be

Moreover, Weiss also goes into the future of logging, and how this rather static area that hasn’t changed much up until recent years – Is going through a revival:

“One of the biggest changes that are happening especially within that sphere is that if up until a few years ago, the main consumer for a log file was an engineer, essentially who is just grepping through a log file, just looking for something, looking for a pattern, looking for bits and pieces of information that will help him or her debug something, troubleshoot something. Now the primary consumer for logs has changed. The primary consumer for a log file nowadays for the most part is a machine who will essentially take that logging stream, that stream of logging messages that are being put into the file and begin to glean meaning from that; visualize trend information, anomalies.”

“There’s been a lot of movement within a lot of companies moving towards metric-driven development. So now we can automatically pull metrics instead of putting a lot of information to the log file with the hope that somebody can process that later and then visualize it for us. Tools like Grafana, DataDog and Librato, are now encouraging us to send metrics directly from our application into the devops dashboard via industry standard protocols.”

The full interview and transcript are available on InfoQ – Play video

To dive deeper and go beyond the 10,000 foot view, you can also check out this hands-on session on how to make your own code production-ready. And to top it up, also be sure to check out this roundup of the most advanced Java debugging techniques you can (and should) use today.

We hope you’ll find these resources useful and if you have more questions that you’d like to be covered, please feel free to use the comments section below!

Say goodbye to digging through logs to solve Java errors – Read more

How to make sense of your stack trace and solve errors faster

Developer experience, as in User Experience where the user is a developer, is often neglected. Many of the workflows and experiences around crafting code, debugging, testing, monitoring and the whole deployment process are really rough around the edges to say the least. This mainly comes from the need to get full control of what’s going on, at the expense of making the experience smoother and enjoyable. But it doesn’t have to be like this. In fact, it shouldn’t be like this.

New Post: Stackifier: Make Sense of Your Stack Trace http://t.co/XpuHMk1HUx pic.twitter.com/25FzLKXXq1

— Takipi (@takipid) August 24, 2015

That’s why we decided to build Stackifier, a free online tool where you can paste your stack trace, and make sense of it. Creating a new representation with a readable design that would highlight the information you need to draw out – without compromising on seeing the full picture. Eventually compacting a never-ending stack trace which can reach hundreds of lines. This is actually a very small part of Takipi, so we thought it would be interesting to focus on it.

Before:

After:

Minimal information loss, maximal value

Your first task when seeing a raw stack trace is to trace where it touches your code for the first time. Where was the exception thrown from in this case. All 3rd party libraries along the way aren’t really interesting, so we grouped them by their top packages and differentiated them by color.

Although, sometime you wouldn’t want to completely lose track of what’s going on in these 3rd party libraries, so we kept that information available. Each grouping can be expanded to view the frames inside the stack trace:

As to the actual frame where the exception is thrown from, we marked it with a flame. In addition, you’ll notice that there’s much less text. Stackifier only shows the SimpleName of the class and method that represents that frame in the stack trace. And the line number is highlighted on the right of each line. Much more fun, useful, and fast than making sense of mountains of text.

Check out Stackifier and visualize your stack trace

Logs remained exactly the same for over 20 years

Think of all the time that’s wasted on understanding or doing things that only exist because they made sense once, and were the right solution then, but haven’t evolved since. You’re used to seeing stack traces in your logs, so it’s easy to “forget” they’re produced by the application you’re running and simply printed out to be stored on file.

And what else you’re looking for when you grep on to that sequence of frames you need to be looking at? Whatever information you’ve decided to print out in log messages that would help you understand the state of those frames. That information is also drawn out from your application. So what if we could skip the log altogether, not write to file at all, and have a smart stack trace data structure that already contains the state of each frame? We could stop using logs, and that’s exactly why we’ve built Takipi for Java. Takipi is a Java agent that knows how to track uncaught exceptions, caught exceptions and log errors on servers in production. It lets you see the variable values that cause errors, all across the stack, and overlays them on your code.

Conclusion

Developer experience is a top priority and the future of developer tools is looking both smart and beautiful. And there’s no reason to compromise on any of these parameters. In fact, when these compromises do get made, precious time can be lost on tedious tasks instead of focusing on innovating and building new features.

Visualize all log events and excpetions in production directly from your JVMs – Send as metrics into more than 20 graphing and alerting tools.

We wanted to share two major features we’re introducing that we thought worth sharing – Trends and DevOps Integrations. We began building Takipi because we were tired of using traditional logs to understand what was going inside of our software in production. We wanted to be able to “see into” each error in staging or production when it happened.

We also wanted to have much better detection capabilities to know when things broke down either because of code or environment changes – without continuously scanning and analyzing massive logs files. This new set of features brings us much closer to that vision, which is why we’re pretty excited about it. So let’s get to it

Introducing Takipi 2.0: Trends and DevOps Integrations http://t.co/NAsP31YsvO pic.twitter.com/xZGxGhdLUz

— Takipi (@takipid) August 27, 2015

Trends

Trends enables you to visualize any combination of log errors, warnings and exceptions in production JVMs without modifying your code, and without sending log files for analysis. The JVM agent detects and tracks all log events and exceptions, regardless of if and how they’re logged to file. Events can be visualized and alerted on using Takipi’s front-end, or through any graphing or alerting tool (e.g Graphite, PagerDuty) that you use. More on that below.

As these metrics are already tracked by the JVM agent today, there’s no additional overhead to this feature (learn more about how Takipi works here). What’s new is that this information is now being reported out of the agent into Takipi’s front-end (or any front-end for that matter, more on that below).

Events are automatically reduced into metrics at the JVM level. You can see exactly how many times an event has happened, and out of how many calls into the code. You can filter events based on their type – caught / uncaught exceptions, logged errors, warnings and more. You can further filter events based on their location in the code, when did they start, or frequency.

Another handy capability lets you correlate related events. This is really useful if you’re looking at a spike in the number of log errors or warnings. You can select the top 5 errors during the spike and add them to the graph and see each one’s contribution. This provides you with a powerful form of operational analysis, without having to adjust your logging or analyze massive logs.

You can narrow down information to focus on events from specific machines or JVMs. You can for example compare the number of errors experienced by JVMs serving different customers in a multi-tenant architecture, or different versions of your application.

Once you’ve focused on a target event you can drill into its root cause in the code. For each event, you can see its most recent error analysis – complete with stack, source and variable state that led to the error.

Integrations

This new capability opens the data captured by the JVM agent to any graphing and alerting tool. This is done via StatsD, an open-source protocol and implementation developed by Etsy which enables applications to transmit metrics in a non-blocking way into any monitoring tool. We chose StatsD because it’s widely used, open-source and uses JavaScript to communicate with almost any monitoring tool on the market.

This enables you to track any series of critical errors in your application such as “All uncaught exceptions”, “DB errors In background tasks”, or “Logged errors in Spark RDDs”, and have the JVM agents transmit these as metrics into Graphite, or receive alerts on them through PagerDuty or Zabbix if they exceed a target threshold.

Let’s look at a real-world example:

In this configuration we’re tracking business metrics in Grafana against logged warnings and errors. You can see how the increase on log.warn begins to reduce throughput. Past a certain threshold, the application begins breaking, with a sudden decrease in the number of processed RPMs, and an increase in the number of errors. We can visualize millions of log events from the JVM in real-time overlaid over the critical business metric, without having to continuously parse massive logs. Connecting these metrics into an alerting systems such as Zabbix or Nagios lets us respond in real-time.

Takipi provides close to 20 back-end integrations of of the box, and you can easily add your own. We look forward to see how you’ll use these new features, and hear your questions and feedback in the comments section below.

Try the new integrations now:

Today we announced our $15M series B funding which was lead by Arif Janmohamed at LightSpeed Venture Partners, and also participated in by our series A investor Venky Ganesan from Menlo Ventures. Two great people who will have a huge impact on our ability to execute and grow.

First and foremost, we’d like to say thanks to our early customers and users for believing in the product and team from the early days of Takipi. Your continuous feedback and the personal stories you share motivate us to keep going and delivering the best possible solution for real-time event intelligence.

This funding was significant for several reasons. The first was that it validated to us, our customers and the investment community that Takipi had achieved product market fit. It’s critical for any startup to have their product, technology and use cases validated before any investment is made in sales and marketing. In just one year since GA Takipi has captured over 120 customers who are changing the way they troubleshoot production applications. These customers range from small startups to large enterprises like Samsung, RMS, HP Enterprise and Amdocs. I’m immensely proud of the whole Takipi team for getting us to this exciting milestone.

In our previous company VisualTao (which was acquired by AutoDesk) we experienced first hand the pain of developing and troubleshooting software for large-scale production systems that were dynamic, distributed and complex. I’ve likened this experience to performing open heart surgery on a train that’s running 100mph – somewhat time-consuming and scary.

While modern software architectures are built for speed and scale, logging and log files have remained pretty much the same for over 20 years. No matter how you look at it, logs are still too verbose and complex to analyze. Unstructured text files by nature are limited by the type of data they hold, and are growing at an exponential rate. They have also become not cheap to parse, index, store and manage.

We built Takipi to solve the limitations, cost and pain associated with analyzing log files. Takipi does this through monitoring applications at the lowest level, and extracting rich smart data in real-time. Unlike plain text log files, Takipi dynamically captures the exact application state at the moment of each error, including the complete application stack trace, source code, objects, and variables that caused it.

Our approach is radically different (and exciting) to what other vendors are doing in this space. Our mission is to tell you When, Where and Why your application code breaks in production. We think we can help you do that 10X faster than how you’re currently doing this, and we’ve got over 120 customers in production who’ve told us the same thing. We believe in collecting the right data (smart) versus collecting the commodity data (big) which lacks context, granularity and insight. Remember, – root cause analysis is only as good as the data you collect and analyze.

Now is the time for us to step on the gas and invest in our sales and marketing efforts. The response so far from candidates has been over-whelming, if you’d like to join our sales or marketing team feel free to reach out to me at tal@takipi.com.

Tal.

There are three common ways to monitor your application:

Application Performance Monitoring (APM)
Log analyzers
Error tracking tools

Most companies will use several tools to get the full picture of their production environment.

Whether you’re already using a certain tool (or 3 different ones), Takipi can seamlessly coexist and integrate with your existing ecosystem, providing unique data and insight to figure out when and why your application code breaks in production. We can even inject links to this insight into your existing log files.

In this post, we’ll see how Takipi fits in with APMs, server monitoring and and log analyzers.

What is APM and what is it good for?

APM tools provides us with analytics around our application’s performance. At the core this means timing how long it takes to execute different areas in the code, and how long it takes for certain transactions to complete – this is typically done using bytecode instrumentation (BCI) to tag and trace transactions and instrument call stacks so that you know which method or service request is slow.

APM enables you to see what the end users are experiencing inside your application. You’ll be able to monitor production environments, application loads (transactions, requests and pages per second), calculate response time and pinpoint what might cause a delayed response.

The top names in the APM space are New Relic, AppDynamics and Dynatrace (Ruxit). New Relic focuses on SMBs and Startups, while AppDynamic and Dynatrace are more enterprise driven, with higher pricing rates. They all give pretty much similar monitoring options, each with its own benefits depending on you specific use case. You can read our comparison posts about AppDynamics vs New Relic or AppDynamics vs Dynatrace.

There are other names in this category, such as AppNeta, Stackify, Scout, DripStat, InfluxData, perfino and Correlsense, and the list goes on and on. Smaller companies might prefer them due to pricing or a basic expectations from a monitoring tool.

Takipi + APM = See Everything (Stack, Source Code & State)

Wait a minute, my APM solution already shows me stack traces, source code and variables? I’m afraid it doesn’t, and if it does, you’re looking at incomplete data.

APM will show you a stack trace for every error and exception that your application throws. At the top of every exception it’s also going to tell you the class and method where it originated. This is identical information to what log analyzers also collect from the application run-time. With APM you know when and where the problem is.

Takipi will also show you a stack trace for every error and exception. However, it will also show you the complete source code, objects, variables and values which caused that error or exception to be thrown. At Takipi we call this identifying the root cause of when your code breaks.

You can also integrate Takipi alongside your APM and keep your current workflow, inside your existing dashboard. For example, you can use our New Relic plug-in that puts a Takipi link next to every error and exception shown in New Relic (see below screenshot).

Takipi injects a hyperlink into the exception’s link, and you’ll be able to jump directly into the source code and actual variable state which cause it. Pretty cool huh?

Takipi can co-exist in production alongside all major APM agents and profilers, allowing you to enjoy maximum visibility and productivity, as each tools provides its own unique capabilities. Using Takipi with your APM allows monitoring server slowdowns and errors, along with the ability to drill down into the real root cause of each issue.

Log Management to The Rescue

Your log files contain a large amount of data you have to sift through if you want to find out what happened inside your application. This include machine data, business metrics such as sale transactions and user behavior, and information about product related issues.

Splunk and ELK (Elasticsearch-Logstash-Kibana) are dominating the log management space with the most comprehensive and customizable solutions. You’ll find other names such as Sumo Logic, Papertrail, Graylog, Loggly, Logentries, XpoLog and Logscape, who offer competitive pricing but don’t compare to the 2 bigger tools.

The strongest use case for logs is troubleshooting, when your log files include logged errors, warnings, caught exceptions and, if you’re lucky, notorious uncaught exceptions. In most cases, you’ll have to dissect the information in order to understand what went wrong in the execution of your code in production.

How Takipi Enhances Log Management

The main challenge when using logs is that they often contain an unmanageable number of entries and require to manually find the needle in the haystack. Usually you’re left with the error message you’ve attached, and the stack trace at the moment of error. That’s just not enough to understand what’s the real root cause of the error, and you’re sent off to an iterative process of trying to replicate the same error, add more info to the log, and so on.

Takipi helps in the debugging process, through inserting hyperlinks into your existing log files, so ops and developers can instantly see the stack, source and state behind each event.

Takipi also de-duplicates the contents of log files to reduce the operational noise and time associated with analyzing them. Detecting exceptions and logging errors happens in real-time, at the JVM level without relying on parsing logs.

Error Tracking for Better Insights

Following and analyzing your errors can give you better insights about your application. Using it, you’re able to see which exceptions were thrown, get real-time crash reporting and find important issues even if you have a huge code base.

Among the names relevant to this category, you’ll find airbrake.io, bugsnag, Raygun, Sentry and Rollbar. The tools capture and track your application’s exceptions, giving you a friendly dashboard instead of the long and messy log. Most tools allow you to filter the relevant exceptions and errors, giving you a clean look on the state of your application.

Making Error Tracking Better

Takipi is the best choice when it comes to error tracking. While all tools can tell you when your code breaks in production, Takipi can tell you why it happened. Takipi’s dashboard shows you the code and the variable state of errors right when they occur, and automatically detects code deployments and alerts you when they introduce new errors.

If you want to be able to get the full path and cause of your errors, along with the actionable information you need to fix them – you should use Takipi.

Takipi can supercharge your monitoring ecosystem

It doesn’t matter if you’re using an APM, log management tool, an error tracker or all 3 tools – Takipi will complement and not replace what you already have.

Takipi runs as a native Java agent and does not require code changes, binary dependencies, or build configurations. The installation process takes less than five minutes and you’ll be able to see the root cause of every application error within a few minutes.

You can sign up for your free trial right here.

Takipi Alerts

New Takipi feature: Critical error alerts via email, Slack, HipChat, and PagerDuty

We’re pretty excited this week, as we just launched the new alerts engine we’ve been working on in the last couple of months. The goal behind this is to allow you to get notified on the most critical errors in real-time, and fix them before they affect your users.

In this post, I wanted to share some of the rationale that guided us in building this feature, and show you some of the things it can do.

New Feature: Alerting Engine – Get Alerted on Critical Errors When They Matter the Most https://t.co/3Rpv61425i pic.twitter.com/tJlutJ1rLE

— Takipi (@takipid) June 1, 2016

Why Alert?

Listening to our customers, developers and ops teams alike, one of the most requested features was related to alerts. Folks wanted to be alerted on errors in context – “Let me know if a new deployment introduced new errors, or alert us when the number of logged errors or uncaught exceptions exceeds a certain threshold, so we can fix it in real time”. So we went ahead and built it.

The main idea behind building this feature was allowing each user to customize when and how they receive alerts. Naturally, a Dev team lead and a Devops engineer might have different requirements and different things they want to be alerted on. We wanted to allow for maximal personalization, while keeping the UI simple and intuitive, building on top of our new views dashboard.

Getting Started

The new views pane in the dashboard allows users to cut through the noise, and focus only on the events that they’re interested in. For example, if you’ve deployed a new version, and want to see whether the new deployment introduced new errors that need to be dealt with, just click on “new today”.

If you’re only interested in critical uncaught exceptions, there’s a view for that as well, and you can add customized views on top of that. Our team, for example, created a dedicated view for NullPointerExcpetions, as we hate those, and want to be able to zoom in on them without noise.

The alerting mechanism uses these views to send contextual alerts in each use case. If you only want to see errors on new deployments, on NullPointerExceptions, or on Uncaught exceptions, you can just click on the bell icon next to that view. You can set alerts and thresholds for any view, and equally important – Decide how you want to be notified.

Common Use Cases

The alert settings dialog consists of two sections – Setting alerts, and configuring how to receive them. Takipi integrates out-of-the-box with a variety of messaging and alerting tools, including Slack, Hipchat and PagerDuty.

For each of the predefined or custom views in my dashboard, Takipi can be set to send two types of notifications: alert on any new error, or when the total number of errors in the view exceeds a target threshold within a rolling hour window.

Let’s say we’re deploying a major version. We’d probably want to get notified if this deployment creates any new errors (Scenario 1).

On the other hand, even if that deployment has not introduced any new types of errors, it’s possible that something went wrong, and the log errors and warnings that already existed in the system (part of the control flow) have spiked, or significantly increased in rate (Scenario 2), in which case, we’d also want to have a deeper look into what caused the spike – Something that looks like this graph:

Let’s examine how we can address both of these scenarios:

1. To set up an alert for any new errors introduced today, I can click on the bell icon next to the “new today” view, and enable it:

This will open the new alerts dialog. Here I can choose how I want to receive this alert. Since new errors are critical for us, I choose for them to be sent via email, Slack, Hipchat and PagerDuty, so there’s no way we’d miss them.

Now, we’ll be alerted on each new error introduced by a new deployment. We can also group these alerts into email digests to avoid multiple alerts:

2. To support our second scenario, we can use the new thresholds feature. Let’s say our baseline for logged warnings is 1,000 messages per hour, but if it passes 2,500 messages in a rolling hour window, it means something went wrong.

In this case we can simply go to the “Log warnings” view in the alerts dialog, and set the hourly threshold for 2,500. Since this alert may not be critical for us, we only chose to receive it on our shared Slack channel:

Here’s how the alert would look like if that threshold is exceeded –

Clicking on it will take me to the exact timeframe in which these errors spiked, so I can zoom in on exactly what caused it. I can use the Split function to see the actual errors and transaction contributing to the spike.

In the same manner, I can create my own customized views and be alerted whenever new errors are introduced into them or exceed a certain threshold within rolling hour windows. For example, we added a dedicated view only for NullPointerExcpetions, since we have a zero tolerance policy for those. Then we set a threshold on 1, so that every time a NullPointerException is thrown in our production environment, we’ll get alerted:

Final Thoughts

This feature has a lot of neat little tricks and customizations. For further reading click here. Have any feedback? Think we should add some more capabilities to this feature? Feel free to leave us a note here, or email me at ophir.primat@takipi.com.

The Pareto logging principle: 97% of logged error statements are caused by 3% of unique errors

cWe received a lot of feedback and questions following the latest data crunching post where we showed that 97% of logged errors are caused by 10 unique errors. By popular demand, we’ll go a step deeper into the top exceptions types in over a 1,000 applications that were included in this research.

Let’s roll.

(btw, this is our first post with a recommended soundtrack, check yo’ self)

New Post: The Top 10 Exception Types in Production Java Applications – Based on 1B Events https://t.co/hSGPgzfZa8 pic.twitter.com/yXPJCpJVEY

— Takipi (@takipid) June 2, 2016

Without Further Ado: The Top Exceptions by Types

To pull out the data, we crunched anonymized stats from over a 1,000 applications monitored by Takipi’s error analysis micro-agent, and checked what were the top 10 exception types for each company. Then we combined all the data and came up with the overall top 10 list.

Every production environment is different, R&D teams use different 3rd party libraries, and also have custom exception types of their own. Looking at the bigger picture, the standard exceptions stand out and some interesting patterns become visible.

1. NullPointerException – 70% of Production Environments

Yes. The infamous NullPointerException is in at #1. Sir Charles Antony Richard Hoare, inventor of the Null Reference was not mistaken when he said:

“I call it my billion-dollar mistake. It was the invention of the null reference in 1965… This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years”.

With a top 10 spot at 70% of the production environments that we looked at, NPEs take the first place. At Takipi, we actually have a special alert that lets us know whenever a new NullPointerException is introduced on our system, this is how you can set it up yourself.

2. NumberFormatException – 55% of Production Environments

In at #2 is the NumberFormatException which happens when you try to convert a string to a numeric value and the String is not formatted correctly. It extends IllegalArgumentException which also makes an appearance here at #3.

One easy fix to make sure that the input you’re passing to the parse method passes these regular expression:

For integer values: “-?\\d+”
For float values: “-?\\d+.\\d+”

3. IllegalArgumentException – 50% of Production Environments

Next up at #3, IllegalArgumentException, appearing at the top 10 exceptions in 50% of the production environments in this survey.

An IllegalArgumentException actually saves you from trouble, and thrown when you’re passing arguments from an unexpected type to your methods. For example, some method that expects type X and you’re calling it with type Y as an argument. Once again, an error that’s caused by not checking what you’re sending out as input to other methods.

*IllegalArgumentException Takipi Monster*

4. RuntimeException – 23% of Production Environments

All exception objects in the top 10 list (Apart from Exception) are unchecked and extend RuntimeException. However, at #4 we’re facing a “pure” RuntimeException, where Java the language actually doesn’t throw any itself. So what’s going on here?

There are 2 main use cases to explicitly throw a RuntimeException from your code:

Throwing a new “generic” unchecked exception
Rethrows:
- “Wrapping” a general unchecked exception around another exception that extends RuntimeException
- Making a checked exception unchecked

One famous story around checked vs. unchecked and the last use case we described here comes from Amazon’s AWS SDK which ONLY throws unchecked exceptions and refuses to use checked exceptions.

5. IllegalStateException – 22% of Production Environments

In at #5, featured at the top 10 exceptions in 22% of over a 1,000 applications covered in this post is the IllegalStateException.

An IllegalStateException is thrown when you’re trying to use a method in an inappropriate time, like… this scene with Ted and Robin in the first episode of How I Met Your Mother.

A more realistic Java example would be if you use URLConnection, trying to do something assuming you’re not connected, and get “IllegalStateException: Already Connected”.

6. NoSuchMethodException – 16% of Production Environments

Such Method, Much Confusion. 16% of the production environments in this data crunch had NoSuchMethodException in their top 10.

Since most of us don’t write code while drunk, at least during day time, this doesn’t necessarily mean that we’re that delirious to think we’re seeing something that’s not there. That way the compiler would have caught that way earlier in the process.

This exception is thrown when you’re trying to use a method that doesn’t exist, which happens when you’re using reflection and getting the method name from some variable or when you’re building against a version of a class and using a different one at production (thanks @braxuss).

7. ClassCastException – 15% of Production Environments

A ClassCastException occurs when we’re trying to cast a class to another class of which it is not an instance. 15% of production environments have it in their top 10 exceptions, quite troublesome.

The rule is that you can’t cast an object to a different class which it doesn’t inherit from. Nature did it once, when no one was looking, and that’s how we got the… Java mouse-deer. Yep, that’s a real creature.

8. Exception – 15% of Production Environments

In at #8 is the mother of all exceptions, Exception, DUN DUN DUUUUN (grandmother is Throwable).

Java never throws plain Exceptions, so this is another case like RuntimeException where it must be… you, or 3rd party code, that throws it explicitly because:

You need an exception and just too lazy to specify what it actually is.
Or… More specifically, you need a checked exception to be thrown for some reason

9. ParseException – 13% of Production Environments

Parsing errors strike again! Whenever we’re passing a string to parse into something else, and it’s not formatted the way it’s supposed to, we’re hit by a ParseException. Bummer.

It’s more common than you might have thought with 13% of the production environments tested in this posted featuring this exception in their top 10.

The solution is… yet again, check yo’ self.

10. InvocationTargetException – 13% of Production Environments

Another exception that’s thrown at us from the world of Java Reflection is the InvocationTargetException. This one is actually a wrapper, if something goes wrong in an invoked method, that exception is then wrapped with an InvocationTargetException.

To get the original exception, you’d have to use the getTargetException method.

We see that 13% of production environments tested in this post had it in their list of top 10 exceptions. The second exception type here that’s directly related to Java’s reflection features.

Final Thoughts

The world of Java exceptions is indeed quite colorful, and it’s amazing to see how much impact the top 10 exceptions have on our logs. 97% of all logged errors come from 10 unique exceptions.

Try Takipi and find out what are the top 10 exceptions in your own production environment, it only takes a few minutes to get started and you’ll also get all the data you need in order to fix them. Source, Stack, State.

Meet MDC and Log View and how they add context to solving errors in production

Takipi is used to troubleshoot production errors and pinpoint their root cause down to the variable values that caused them. To provide even more context to the application’s state at the moment of error, one of the biggest requests we’ve received was the ability to see the thread-local state of the transaction.

In this post, I’d like to share the story behind 2 new features that originated from user feedback, and how they push the boundaries on what you can expect from solving Java application errors in production.

Introducing Log View: Source, Stack, State and the Logs that Really Matter https://t.co/QGQn4M5pxN pic.twitter.com/XLSr8piiSl

— Takipi (@takipid) July 18, 2016

Let’s start with MDC

MDC, short for Mapped Diagnostics Context is the thread-local state Java logging frameworks such as slf4j and logback associate with an executing thread. This information is loaded by the application into the log framework at the beginning of a transaction, and emitted by the log framework into the log file to describe the application / business context in which the code was executing.

This can be a customer ID, username or unique transaction identifier maintained across a distributed processing chain.

When troubleshooting code, having access to this context is paramount – it’s with this information you can better understand why (and for whom) the code was executing, or search across your log files for statements with that same identifier to see the full story that befell a failed transaction across multiple machines or services.

The good news is Takipi now captures the full MDC state at the moment of error for every source, stack and state analysis. This is powerful as it provides you with the full context in which code failed. You can see an example of MDC state in the following screenshot from our actual production environment – it holds a wealth of information related to the executing code, which is now a part of every Takipi data capture.

*The Takipi error analysis variable table*

Going beyond MDC (and into something very cool)

Once we added the MDC, a sneaky thought sneaked up into our mind – why capture just the log MDC – why not capture the log statements themselves?

So this is exactly what we’ve just added – the ability to see the last 250 log statements within the thread leading up to an error. The cool thing about it is we don’t capture those statements from the log file, but directly in-memory as they are logged by the code.

What can you do with Log View?

1. DEBUG and TRACE in production

Using log view you can see DEBUG and TRACE statements leading to an error, even if they were not written into the log file in production. This is because we capture log statements as they happen inside the application in real-time, without being dependent on whether or not the log verbosity level allows for them to be persisted.

Since these statements are only captured from memory when Takipi takes a snapshot, without relying on log files, there is no additional overhead to the size of your logs. You can have your cake and eat it too!

This is a huge win for devs, as this information is almost never accessible in production and is beneficial in troubleshooting issues. Getting to it usually requires turning verbosity on and recreating the issue to get to those statements, lengthening the resolution process and requiring involvement from Ops, and at times even the customer.

2. Access logs directly from JIRA, Slack and Hipchat

As log statements are now a part of any state captured by Takipi, alerts about new errors become much more powerful as they not only include the source, stack and state, but also the log data most relevant to the error. So once you receive an alert about a new error that was just deployed, it will provide you with the most relevant log data (and MDC) without having to pull and grep for the log data out of the production system.

3. Focus on the right log events

We capture the log statements related to the thread in which the error happened (vs. statements from the 300+ threads you may have running concurrently), so you can focus immediately on the data relevant to the error.

*Notice how the beginning of the transaction is marked, as well as the existence of DEBUG statements*

Takipi automatically highlights the log statement in which the transaction started, so you can immediately know which logs statements are relevant to the transaction vs. ones that are pure noise.

4. Diskless logging

As log statements are persisted as part of the snapshot Takipi captures without reliance on physical logs, you’re not dependant on having access to the log files. This is very powerful in elastic environments in which by the time you’ve been made aware of an error the log files may have already been lost as the machine was taken down.

You can access snapshots for up to 90 days in our SaaS version, or for even longer periods of time in our on-premises version. As these are periodic snapshots, they can be retained for long periods without straining your storage infrastructure. You’re not limited to a short log retention policy before logs are recycled.

Final Thoughts

We’re super excited about this feature and would love for you to take a look at it and tell us what you think in the comments section below.

New Takipi feature: Public and private views

Yogi Berra once said – “You can observe a lot by watching”, and what’s true for Major League Baseball is more than certainly true for DevOps nowadays. Our main challenge in building and operating complex applications is to watch the right things – be it something that we broke in a recent release, or a web service that’s no longer playing well with our application and causing us some major heartburn.

Takipi’s goal is to help you see the cause behind issues that impact your application in production with the full source code and variable state leading to them as quickly as possible. The challenge is with applications firing thousands or millions of errors over the course of a day to distill and classify them into meaningful numbers. Things become especially challenging when classifying these errors requires parsing them out of our log files, which requires tremendous amount of work and configuration to deduplicate into something you can consume.

New Feature: Introducing Takipi’s Dashboard Views: See the Log Events That Matter https://t.co/LqXkzrwpcV pic.twitter.com/4zRAc4nrRf

— Takipi (@takipid) August 18, 2016

Real-time log deduplication

To overcome this, Takipi uses JVM micro-agents to detect and deduplicate all errors and exceptions at the code executes in real-time vs. parsing them from logs as an infinite stream of unstructured log messages. This has the effect of reducing millions of events into dozens of contextual, code-related analytics that can be easily visualized to see exactly which errors are impacting your applications at any time.

Reducing over 300,000 errors from our production app over the last day into 50 discrete analytics, and visualizing the impact of the top 5 errors without parsing one log line. — *Reducing over 300,000 errors from our production app over the last day into 50 discrete analytics, and visualizing the impact of the top 5 errors without parsing one log line*

Creating your Own Public / Private Views

But if that’s not neat enough, you can now go one step further and use the dashboard’s filters system to save, reuse and share views. In essence this enables you to ask the dashboard any question, such as show me errors only from specific code transactions or locations, or target specific types of log errors or exceptions, and visualize them directly. This creates a powerful way to isolate and inspect any aspect of your environment to know whether they’re functioning correctly or breaking.

You can filter based on almost any aspect - code transaction and location, content, type, applications impacted and more — *You can filter based on almost any aspect – code transaction and location, content, type, applications impacted and more*

A few examples of views you can easily define:

See what broke today in app X – you can create a view to show new errors introduced over the last day in application X by limiting the built-in “New today” view to a specific application and save it as a new view.
Whack all the NullPointers – another easy view to setup is filter the error type to NullPointerExceptions and setup an alert whenever one is introduced, so you can fix it right away. These little moles can have a huge impact on both the stability of your application and the size of your logs.
Visualize business critical errors – you can choose to focus o a set of specific log errors (e.g “Failed to complete transaction”, “Salesforce error”,..) within the context of your app and save that as a view to visualize and be alerted on.

Creating a view for any error or exception containing the word “Salesforce”, without parsing logs. — *Creating a view for any error or exception containing the word “Salesforce”, without parsing logs*

A view can be set to be private only to you, or as one that will be shared and visible between all teammates accessing that environment. Once you’ve setup a view you can also receive alerts via Slack, PagerDuty and more whenever a new error starts, or the volume of the errors exceeds a target threshold.

Saving a filter is easy - simply filter the events list to show the events you’re interested in over the desired timeframe, select whether to share this view and hit “Save view”. — *Saving a filter is easy – simply filter the events list to show the events you’re interested in over the desired timeframe, select whether to share this view and hit “Save view”*

Click here to learn more about Dashboard views, or add your questions and comments in the section below – we’d love to hear them!

One click and one second to root cause analysis. See how it all started.

OverOps was built to help developers know when, where and why code breaks in production. After an amazing journey with 300% growth in new customers and a brand new name, it’s time to see how it all started – from the basic idea up to the OverOps we know today.

New Post: The OverOps Story – It’s the End of Log Files as We Know It https://t.co/5kEasCGeOG pic.twitter.com/p0WCMnoT8p

— OverOps (@overopshq) October 13, 2016

Pains and Challenges of Scaling

OverOps was founded at 2011, based on an idea that originated from the team’s first company, VisualTao (Acquired by Autodesk on 2009).

VisualTao enabled designers and engineers to edit, share, and collaborate over 2D and 3D designs. After its acquisition, VisualTao was relaunched as AutoCAD web and mobile, which became the biggest launch in well over a decade for Autodesk. It was the company’s flagship $1.3B product line, servicing over 20M professional designers and engineers worldwide.

AutoCAD used the best APM and log analyzers in the market in order to find errors and exceptions in its own app. Despite having those tools, the team still had to sift through mass amounts of log files whenever the application broke, in search of an answer as to why it broke.

Through the pains and challenges of scaling AutoCAD, the OverOps team knew what they were looking for when facing an error in production.

Making Logs Better

APM tools are great at telling you when web pages render or mobile apps respond slowly, but they don’t show you the actual root cause. When an application breaks, in 9 out of 10 issues, it’s the logs that give dev and ops the actual root cause of an error.

The inherent problem with logs is that they contain millions or billions of text events, and require you to search through them. Logs assume that you know what to look for, and that it’s in fact in the log. It’s a highly reactive process.

Errors might lie dormant in log files for weeks, impact the user’s experience and only surface after the damage is done.

That’s why the team wasn’t looking for a better log analyzer, but better logs. Instead of getting massive amount of information in one giant file, we wanted a one-click zoom-in view that includes the actual cause of any error.

Zooming in would provide the complete source code and variable state that caused each error. Zooming out will give a broader look, that will include data about when new errors were introduced, and when critical ones increase. The team wanted to enable this without having to parse and Regex through mass amounts of unstructured log data.

Some developers considered it science fiction.

Building OverOps

Most tools nowadays are parsing issues out of TBs of log files downstream, where the intelligence of the code has been lost, and try to reconstruct them from text. OverOps uses a completely different technology to detect and analyze issues in real-time within the application.

OverOps uses a micro-agent that operate between the software VM (i.e. JVM, CLR) and the processor. That gives it two “superpowers”:

It can see all events in the application, regardless of whether they originate from the application code, 3rd party or JVM code
Dev and ops can react 10X quicker than through classic techniques such as bytecode instrumentation or logging

This enables OverOps to “fingerprint” any event as it occurs in real-time in production. You can know exactly whether or not it’s new, when was it introduced, how often it’s been happening, and out of how many calls into the code.

Meet the Micro-Agent

When the micro-agent detects a critical event, it zooms-in and captures the actual source code that was executing within the JVM, and the complete variable state across the entire call stack (even debug log statements that don’t even appear in logs). That information provides you with the full root cause analysis needed to fix an issue, without having to spend hours / days to reproduce it.

This information has literally 100X more variable state data than you would get in a log file, without having to change and redeploy code to capture more state. The agents operate at under 3% CPU.

The tool alerts developers on new or increasing errors within a second of them happening in production. The alerts can be sent out directly through Slack, HipChat, PagerDuty or JIRA and includes a link directly to the complete source code, variables and debug-level log statements.

This enables customers to move from issues lying dormant for days/weeks and then erupting, to detecting them in seconds. The dashboard gives developers the information they need to fix these errors in one click, without having to go through the back and-forth with ops to get logs and reproduce an error. One second, one click – it’s that easy.

OverOps runs as SaaS, Hybrid or fully on-premises, and requires no changes to code or build.

We have 150+ customers such as Amdocs, TripAdvisor, RMS and others.

If this sounds interesting to you, sign up for a free 14 days trial.