AryaNet

Cassandra Garbage Collector Tuning, Find and Fix long GC Pauses

Arya — Tue, 08 Apr 2025 22:50:00 +0000

17 minutes read

Cassandra garbage collector tuning has been a hot topic of discussion in the Cassandra user mailing list from time to time. Users and administrators are required to gain some skills on how to properly tune the garbage collector for their Cassandra nodes so that they can avoid long stop the world pauses that can bring their cluster to halt. Although out of the box garbage collector settings work for most use cases, as the data volume and request traffic per Cassandra node increases, the demand for tuning garbage collector increases also. In this article I’ll explain how you can analyze your Java Virtual Machine and optimize its heap size and settings for your use case. Some of the techniques described here can be applied to tuning any application running on JVM, but majority of them only apply to the way Cassandra works.

Background

Since AWS has launched the beefy hi1.4xlarge instances with SSDs, and Netflix has published a great benchmark of Cassandra performance with them, they became very popular among Cassandra community, and I decided to go through the path of shrinking the Cassandra cluster to fewer nodes utilizing the power of these instances. The initial 24 node cluster was pretty much running with vanilla GC settings that shipped inside cassandra-env.sh. Due to the fact that the 6 nodes had to handle 4 times as much workload, Cassandra garbage collector tuning became important to me and I had to learn and change some settings so that the nodes can utilize the power of their beefed up hardware. These changes included several JVM settings, and at the end I had to actually increase the size of JVM Heap from 8Gb to a mighty 24Gb to maintain the low SLAs and 95th percentiles bellow 200ms.

Problem

Your clients are timing out for a closed period of time. If you are lucky this period would not be bigger than a few hundred milliseconds, however by the time people usually complain, this period has gone beyond seconds which is a really bad behavior for the database layer. The only assumption I am making here is that these timeouts are due to Garbage Collector Stop the World pauses and not other issues such as CPU or I/O saturation. In my case I was able to produce a handful of multi-second stop the world GCs during my load test throughout a day. Hence Cassandra Garbage Collector tuning became a new topic of interest to me.

When Garbage Collector Stop the World pauses are the culprit?

Garbage collectors like Concurrent Mark and Sweep (CMS) and Parallel New (ParNew) will always have stop the world pauses. Usually the pause is small enough that is not noticed by the application. However, the pause can be long enough such that it makes the application unresponsive. As the first step, we need to implement adequate logging and monitoring for our system because you want to be able to reference history of events when the issues occur and trying to watch metrics with tools that show current state of application may not be sufficient. Java provides a nice facility that can log very detailed information about the heap statistics and garbage collector performance. Out of the box Cassandra by default ships with a set of GC logging parameters that are originally disabled in cassandra-env.sh. I highly recommend everyone to enable those on their production clusters. In addition to the default logging options, I’ll recommend enabling these flags as well:

JVM_OPTS="$JVM_OPTS -XX:PrintFLSStatistics=1"
JVM_OPTS="$JVM_OPTS -XX:+PrintSafepointStatistics"
JVM_OPTS="$JVM_OPTS -XX:+PrintClassHistogramBeforeFullGC"
JVM_OPTS="$JVM_OPTS -XX:+PrintClassHistogramAfterFullGC"

The histogram in particular will be very helpful because usually at the time that you have long pauses, you want to see what objects are consuming most of the heap, and the histogram will provide you with the answer to that.

There are other tools you can use to monitor the heap space such as jstat and I recommend using them. If you are fancy you can parse, collect, and plot the data from these tools and logs into your monitoring application. In my case, I plot both stop the world GC times from the logs, which is the most accurate stop the world metric, as well as how the generations inside the heap are utilized from jstat:

jstat -gcutil `cat /var/run/cassandra/cassandra.pid`

In the top graph we see that a little before 16:00 there is a sharp cliff in the heap space used. This means that CMS Collector kicked in. You can see that it kicked in around 6Gb heap usage which is 75% of total 8Gb. You also see another cliff before 21:00 which is deeper. The 75% is based on the GC flag which is set inside cassandra-env.sh as CMSInitiatingOccupancyFraction=75. It tells JVM to force CMS to run a full sweep if heap is filled up at least 75%. This process doesn’t always take too long. As you see in the bottom graph, only the first cliff caused a stop the world GC of about 15 seconds which is terrible, but the second one didn’t despite it being able to free up more space. Long pauses are usually caused by full GCs. And full GCs happen when CMS cannot collect enough objects concurrently hence it falls back to the Parallel algorithm, pauses and scans all objects in the heap. I’ll talk more about this later in the heap crash-course.

No Graphs, Don’t worry! There are logs with this information

If you don’t have the fancy graphing tools, or even forgot to turn on GC logging options in cassandra-env.sh, Cassandra’s system.log will log every stop the world GC that took more than 250ms as long as you did not reduce the logging severity to something less than INFO in log4j-server.xml. You will see log lines like this:

INFO [ScheduledTasks:1] 2013-08-09 15:54:41,380 GCInspector.java (line 119) GC forConcurrentMarkSweep: 15132 ms for 2 collections, 4229845696 used; max is 25125584896

In the GC log you will find something like this:

Total time for which application threads were stopped: 15.1324200 seconds

Congratulations. You have just found out that you have a problem with super long GC pauses. Hence the rest of this article become relevant to you.

Understanding Java Garbage Collection

I highly recommend you to read Understanding Java Garbage Collection which is really well written in my opinion as I am not going to duplicate this great article here. It will help you understand how the JVM heap is setup and how different GC algorithms work. Then head back here and I will walk you through scenarios that can cause Cassandra to have long GC pauses.

Cassandra Garbage Collector Tuning

The basic takeaway from the Understanding Java Garbage Collection is that Garbage Collection process works best with the assumption that most objects become unreachable soon and there are small number of objects referenced in between old generation and new generation. If these two assumptions are violated we end up with slow GC performance. So, for Cassandra we think about how objects end up being long lived in the heap leading to this scenario. It is more efficient for objects to expire in the young generation than having to get promoted to the old generation and cleaned up there with a full GC cycle. There are usually two scenarios in which GC algorithms fail and fall back to the full parallel stop the world GC which is the case for super long pauses:

Premature Tenuring leading to ParNew Promotion Failures

Premature Promotions is when objects don’t live long enough in the young generation before they expire. This causes the ParNew cycles to promote them to the old generation. The more objects get promoted to the old generation like this, the more fragmented heap will get. As a result there won’t be enough room in old generation for young objects and ParNew will fail with Promotion Failure.

The following graph shows the percentage utilization of different generations inside the heap space. The blue line which is like a saw, is the Eden space. On average 51% of eden is utilized which is normal. But if you look at survivor space utilizations, you notice they are utilized 5% on average. There is a cliff on the old space utilization before 16:00 which co-relates to our slow GC. This means that although the survivor spaces seem to be filled 50%, the old generation keeps filling up earlier than survivors. This means that there is opportunity to tune the young generation as we have a premature tenuring. Heavily loaded servers specially serving thousands of read requests can easily fall into this scenario.

This can be observed from the GC logs as well:

3109400.009: [ParNew (0: promotion failure size = 9564) (2: promotion failure size = 180075) (3: promotion failure size = 8194) (5: promotion failure size = 8194) (8: promotion failure size = 180075) (9: promotion failure size = 8194) (promotion failed)
Desired survivor size 41943040 bytes, new threshold 1 (max 1)
- age 1: 78953120 bytes, 78953120 total
: 736942K->737280K(737280K), 0.4781500 secs] 5904031K->5961105K(8306688K)After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 254740045
Max Chunk Size: 5168
Number of Blocks: 180395
Av. Block Size: 1412
Tree Height: 183
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 283136
Max Chunk Size: 283136
Number of Blocks: 1
Av. Block Size: 283136
Tree Height: 1
, 0.4798500 secs] [Times: user=1.13 sys=0.00, real=0.48 secs]
Heap after GC invocations=152713 (full 89):
par new generation total 737280K, used 737280K [0x00000005fae00000, 0x000000062ce00000, 0x000000062ce00000)
eden space 655360K, 100% used [0x00000005fae00000, 0x0000000622e00000, 0x0000000622e00000)
from space 81920K, 100% used [0x0000000622e00000, 0x0000000627e00000, 0x0000000627e00000)
to space 81920K, 99% used [0x0000000627e00000, 0x000000062cdab9f8, 0x000000062ce00000)
concurrent mark-sweep generation total 7569408K, used 5223825K [0x000000062ce00000, 0x00000007fae00000, 0x00000007fae00000)
concurrent-mark-sweep perm gen total 60640K, used 36329K [0x00000007fae00000, 0x00000007fe938000, 0x0000000800000000)
}
GC locker: Trying a full collection because scavenge failed

CMS Concurrent Mode Failure

There are several reason for CMS to return a concurrent mode failure. Read the Understanding CMS GC logs to familiarize yourself with all of them. The most common one is when the application is allocating large objects and CMS gets interrupted in the middle of its work because the tenured generation fills up faster than it is cleaned up. A few bad designs and uses of Cassandra may lead to this. If you have concurrent mode failure it will be logged in your GC log:

2013-08-03T12:31:23.696+0000: 1245799.608: [CMS-concurrent-sweep: 4.524/6.539 secs] [Times: user=10.08 sys=0.19, real=6.54 secs]
(concurrent mode failure)CMS: Large block 0x00000006992e3d78

The large block specifically is a sign that there is a large object in the heap.

Cassandra Heap Pressure Scenarios

There are several cases in which you will have a heap pressure. I have personally experienced all of them. In this section I will describe each scenario in the order based on the frequency that Cassandra users report them. It is advised that you troubleshoot and get to the root cause before trying to implement solutions in a trial and error fashion. The following assumptions are made based on defaults settings of Heap and GC in cassandra-env.sh:

Maximum heap size is 8Gb;
Young generation (NewGen) is going to be calculated by cassandra-env.sh to be 1/4 of the heap;
- Therefore, the young generation size will be 2Gb;
SurvivorRatio is 8 which makes the ratio of each survivor space to eden to be 1:8. Hence each survivor space will be 1/10 of the NewGen which will be 200Mb;
- This concludes Eden space to be 1600Mb;
TenuringThreshold is 1 which means after one young generation GC cycle, objects will be promoted to tenured space;
Memtables are using 1/3 of the heap which is 2.6Gb;
Row Cache is off; or if it is enabled, you will be using SerializingCacheProvider which stores information outside of heap;
Key Cache is using 100Mb;

Aggressive Compactions

By default Cassandra will run one compaction thread per CPU core available. During the compaction process, Cassandra will read each row at a time from all SSTables that it belongs to, it cleans up tombstones if they are expired beyond gc_grace_seconds, and sorts the columns and writes the row back into a new SSTable. It does this process all in memory up to the value of in_memory_compaction_limit_in_mb in cassandra.yaml which is 64Mb by default. Now if your NewGen size is 2Gb on a 16 core machine, you can have up to 64Mb x 16 = 1Gb of heap filled up with columns from compaction threads. Depending on how your compaction is throttled, these objects can live longer in heap and be quickly promoted to the old space causing it to get filled up. If you are running a high CPU machine with the default heap settings, this can easily hit you. It is hard to absolutely pin the problem to compaction but with instrumentation in Cassandra code you can make that conclusion.

Let’s say a slow stop the world GC has happened and you have a histogram of objects inside the heap from GC logs:

Histogram Before GC:

num #instances #bytes class name
----------------------------------------------
1: 13760719 4137773696 [B
2: 7745592 371788416 java.nio.HeapByteBuffer
3: 445843 149064744 [J
4: 2237102 139165560 [Ljava.lang.Object;
5: 3709467 118702944 org.apache.cassandra.db.Column
6: 1225225 117621600 edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch
7: 1616600 77596800 edu.stanford.ppl.concurrent.SnapTreeMap$Node
8: 1519971 67141976 [I
9: 1576078 63043120 java.math.BigInteger
10: 2528181 60676344 java.lang.Long
11: 2494975 59879400 java.util.concurrent.ConcurrentSkipListMap$Node
12: 1235423 59300304 edu.stanford.ppl.concurrent.SnapTreeMap$RootHolder
13: 1938545 46525080 java.util.ArrayList
14: 1912334 45896016 java.lang.Double
15: 1225225 39207200 edu.stanford.ppl.concurrent.CopyOnWriteManager$Latch
16: 13951 37558032 [[B
17: 912106 36484240 java.util.TreeMap$Entry
18: 1139680 36469760 java.util.ArrayList$Itr
19: 222470 35978688 [C

Histogram after GC:

num #instances #bytes class name
----------------------------------------------
1: 7728112 1142432184 [B
2: 48827 100400768 [J
3: 1072172 51464256 java.nio.HeapByteBuffer
4: 8801 30603624 [[B
5: 425428 20420544 edu.stanford.ppl.concurrent.SnapTreeMap$Node
6: 483982 19359280 java.util.TreeMap$Entry
7: 104987 14104448 [C
8: 523857 12572568 java.lang.Double
9: 250248 10009920 org.apache.cassandra.db.ExpiringColumn
10: 50021 7251040 <constMethodKlass>
11: 221730 7095360 java.util.HashMap$Entry
12: 50021 6814168 <methodKlass>
13: 5086 5767408 <constantPoolKlass>
14: 49322 4734912 edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch
15: 140638 4500416 org.apache.cassandra.db.Column
16: 86506 4170536 [I
17: 5086 3636632 <instanceKlassKlass>
18: 4532 3520640 <constantPoolCacheKlass>
19: 31370 3396504 [Ljava.util.HashMap$Entry;
20: 211417 3382672 java.lang.Integer
21: 77307 3092280 java.math.BigInteger
22: 119683 2872392 java.util.concurrent.ConcurrentSkipListMap$Node
23: 53716 2578368 edu.stanford.ppl.concurrent.SnapTreeMap$RootHolder
24: 5819 2574720 <methodDataKlass>

The histograms are much longer; I am pasting the top part which is usually the most useful. You can see that before GC runs, heap had 113Mb of Columns. It seems a large number for an entity which should rapidly go in and out of the heap. Columns are barebones of data storage in Cassandra code, so there could be many sources:

Wide tow read with lots of columns (more on this later);
Compaction reading rows to compact;
Repair (Validation Compaction) reading columns to form Merkle Tree;

First you want to rule out the wide rows which is usually an application issue. Then you can figure out if it is compaction. If you have coding skills, you can add instrumentation to sample Column object’s instantiations to see what process instantiating so many Column objects:

INFO [CompactionExecutor:630] 2013-08-21 12:16:00,010 CassandraDaemon.java (line 471) Allocated a column
java.lang.Exception: Sample stacktrace
at org.apache.cassandra.service.CassandraDaemon$3.sample(CassandraDaemon.java:471)
at org.apache.cassandra.service.CassandraDaemon$3.sample(CassandraDaemon.java:468)
at com.google.monitoring.runtime.instrumentation.ConstructorInstrumenter.invokeSamplers(Unknown Source)
at org.apache.cassandra.db.Column.<init>(Column.java:78)
at org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:109)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:92)
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumnsFromSSTable(ColumnFamilySerializer.java:149)
at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.jva:234)
at org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:114)
at org.apache.cassandra.db.compaction.PrecompactedRow.<init>(PrecompactedRow.java:98)
at org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:160)
at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:76)
at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:57)
at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:114)
at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:97)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:145)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:211)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

In this case we can see tons of these which shows compactions were creating lots of columns.

Once you found compaction to be the issue, try throttling them down and in a high CPU machine, even reduce the number of concurrent_compactors from default to something like half the number of cores.

Reading Large Slices of Wide Rows

Some people criticize having wide rows because they introduce a hotspot into the system. This is true if the key is very active. However, for implementing global secondary indexes for the purpose of pagination, it can be a simple design. However, it should be handled with care. If your application is reading hundreds of columns from a single wide row , it will fill up new generation fast causing promotion failures, and sometimes concurrent mode failures leading to super long GC pauses. The first step to figure out if you have this problem is to look at your Column Family statistics with ‘nodetool cfstats’ command and examine if there is a column family with maximum compacted row size above 10Mb:

Column Family: MySexyCF
SSTable count: 15
SSTables in each level: [1, 4, 10, 0, 0, 0, 0, 0, 0]
Space used (live): 3787762351
Space used (total): 3802583362
SSTable Compression Ratio: 0.34535868204500436
Number of Keys (estimate): 3508992
Memtable Columns Count: 3229
Memtable Data Size: 1506100
Memtable Switch Count: 30
Read Count: 12124
Read Latency: NaN ms.
Write Count: 811644
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 53
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 2158960
Compacted row minimum size: 150
Compacted row maximum size: 223875792
Compacted row mean size: 3103
Average live cells per slice (last five minutes): 51.0
Average tombstones per slice (last five minutes): 0.0

In this case, this example CF could be the source of long row. Compacted row maximum size is 223Mb. If a client attempts to read this row entirely, there will definitely be a GC hiccup. The last two lines provide useful information as to how your clients are reading from this column family. Average live cells per slice tells that queries are usually reading 51 columns at a time which is good. If this number it above hundred, depending on the size of data stored inside your columns, you may have the problem and need to tune down your application to read lower number of columns at a time and paginate through.

Rows with lots of Tombstones

When you have rows with dynamic columns and your application deletes those columns, Cassandra will apply a tombstones on those columns until next compaction runs and cleans up expired tombstones passed gc_grace_seconds. When your client does a slice query to read a few columns, even if majority of columns were perviously deleted, Cassandra will have to read and scan all the tombstones in order to find as may non deleted columns as you requested in your slice query. This also causes a major GC pause specially when the number of those tombstones is in hundreds. You can troubleshoot this by looking at ‘nodetool cfstats’. Take the example in previous section, Average tombstones per slice will give you that information. If it is a large value then that is most likely one source for your problem. The following snipper from GC logs shows about 66Mb of DeletedColumn object which could also be a clue:

num #instances #bytes class name
----------------------------------------------
1: 13816038 2089895152 [B
2: 17368263 833676624 java.nio.HeapByteBuffer
3: 4677858 449074368 edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch
4: 7836709 376162032 edu.stanford.ppl.concurrent.SnapTreeMap$Node
5: 8167750 261368000 org.apache.cassandra.db.Column
6: 4736849 227368752 edu.stanford.ppl.concurrent.SnapTreeMap$RootHolder
7: 4677859 149691488 edu.stanford.ppl.concurrent.CopyOnWriteManager$Latch
8: 1720376 117984464 [Ljava.lang.Object;
9: 7023535 112376560 java.util.concurrent.atomic.AtomicReference
10: 2706529 108261160 java.math.BigInteger
11: 2614011 88889152 [I
12: 482884 86773680 [J
13: 3540985 84983640 java.util.concurrent.ConcurrentSkipListMap$Node
14: 2466985 78943520 edu.stanford.ppl.concurrent.SnapTreeMap
15: 3277799 78667176 org.apache.cassandra.db.ColumnFamily
16: 2169005 69408160 org.apache.cassandra.db.DeletedColumn
17: 2586956 62086944 org.apache.cassandra.db.DecoratedKey
18: 2466980 59207520 org.apache.cassandra.db.AtomicSortedColumns$Holder
19: 1627838 52090816 java.util.ArrayList$Itr
20: 2584943 41359088 org.apache.cassandra.dht.BigIntegerToken

To cleanup tombstones, you can lower gc_grace_seconds and force user defined compactions on SSTables in question. [TODO: write an article for this]

Key Cache

Key Caches are very useful, so sometimes people think that if they increase the key cache size portion of the heap, they can help their application. It is true and false. You can increase the key cache size to the point that you are occupying a significant portion of the heap. Since key caches can be long lived if you have hot rows, they can get promoted to the old generation fast. This also limits the amount of space available to other operations like reads, repairs, compactions, and memtables.

Problem with Key Cache Algorithm in Pre Cassandra 1.2

If you are running Cassandra 1.1 or earlier and have set your key cache size to any size, it is a lie. In this case the cluster had 1Gb as Key Cache size. This problem exposes itself quickly if you have a use case like a map-reduce job that reads whole bunch of keys at once. Lets look at the following histogram:

num #instances #bytes Class description
--------------------------------------------------------------------------
1: 18411450 1004061680 byte[]
2: 42400 773751784 long[]
3: 15604278 499336896 java.util.concurrent.ConcurrentHashMap$HashEntry
4: 15603972 499327104 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node
5: 17395823 417499752 java.lang.Long
6: 15604024 374496576 org.apache.cassandra.cache.KeyCacheKey
7: 15603973 374495352 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue
8: 2825682 135632736 java.nio.HeapByteBuffer

There are 773Mb of long arrays and over 1Gb of byte arrays and over 15 million Key Cache objects which is in accordance with the number of keys that the cluster had. If you look at Cassandra Code , there is an assumption that keys are 48 Bytes (AVERAGE_KEY_CACHE_ROW_SIZE). This assumption is used to compute the size of key cache consumption in heap. You can quickly make a guess that if your developers have made keys that are longer than 48 Bytes, your key cache can easily use more memory than it should have without you knowing it.

The best solution for this is to not have such a large key cache anyway. The default of 100Mb can hold a lot of keys if you are not crazy to have keys more than 48 Bytes. I usually advise my developers to not have keys more than 32 Bytes. Alternatively, you can disable the key cache entirely for the column families with large keys, or upgrade to Cassandra 1.2 as this issue (CASSANDRA-4315) is fixed.

Row Cache

If you are using Row Cache and you are specially using the old ConcurrentHashCacheProvider, row caches are stored on the heap. For the very same reasons as the previous section, you may run into heap pressure and super long GCs. Therefore use SerializingCacheProvider or turn of Row Cache completely.

Memtables

If you have a write heavy workload, it is usually a good idea to increase the memtable segment of the heap because it absorbs more writes and leads to efficient compactions when flushed to disk. However, on a workload which has more reads than writes, if memtable space is large, it takes room away from heap for other purposes like serving reads. Moreover, if memtables aren’t flushed because the flush threshold is large, they have higher chance of being promoted to the tenured space. Usually when long GCs happen and Cassandra is under a heap pressure, MemoryMeter.java will output useful logs into system.log with details about how much cells where in memtables and from what column family. From that output you can decide if memtables are wasting your heap.

In this scenario, you can reduce memtable_total_space_in_mb in cassandra.yaml. If you have a high write and read workload and you need to optimize for both, you may have to increase the heap size.

Heavy Read/Write Workload

If you have mix of read and writes in the order of a few thousands of each per node, it is usually the case that the default heap settings will not work for you. Due to the high load which is the nature of your cluster (in my case shrinking the load of 26 nodes to 6), the ParNew cycles will happen more frequently. This means that the objects in the young generation that could be anything, columns being read, column being compacted, key caches, memtables, etc will quickly get promoted to the old generation after 1 GC cycle. In this case old generation will fill up faster and you will potentially have more full GCs. Looking at this example:

Notice all object have age of 1. This mean these survived after one ParNew GC round.

In this case, try increasing the value of MaxTenuringThreshold to something bigger. In my case I had to increase it up to 20. This will increase the amount of work ParNew has to do, but it prevents objects from getting promoted quickly. Then in GC logs you can see objects will love longer in the young generation:

3300904.569: [ParNew
Desired survivor size 322109440 bytes, new threshold 16 (max 20)
- age 1: 21130648 bytes, 21130648 total
- age 2: 14694568 bytes, 35825216 total
- age 3: 16107080 bytes, 51932296 total
- age 4: 14677584 bytes, 66609880 total
- age 5: 22870824 bytes, 89480704 total
- age 6: 15780112 bytes, 105260816 total
- age 7: 10447608 bytes, 115708424 total
- age 8: 14478280 bytes, 130186704 total
- age 9: 10581832 bytes, 140768536 total
- age 10: 20488448 bytes, 161256984 total
- age 11: 16537720 bytes, 177794704 total
- age 12: 8377088 bytes, 186171792 total
- age 13: 8877224 bytes, 195049016 total
- age 14: 9803376 bytes, 204852392 total
- age 15: 96226928 bytes, 301079320 total

ParNew auto tunes itself. If it sees more long lived objects, it will increase the TenuringThreshold after each run up to the MaxTenuringThreshold. It it sees objects are expiring faster, it will reduce the TenuringThreshold.

All has failed? You are still seeing promotion failures! Need larger Heap

If you are unlucky like me you have to go to the final resort which some people hate in Cassandra community. You will need more room for your young generation but it is not advised to increase the young generation to more than 1/4 of the total heap size. In fact Java Hotspot won’t start if someone increases the young generation beyond 1/3 of heap space. You need to have larger old generation for healthy promotions. In this case gradually start increasing your heap size and young generation size to the point that you don’t go beyond 1/2 of your available physical memory. I recommend increasing the heap size by 2Gb at a time and adjust the young generation size accordingly. Keep in mind to keep your high value of MaxTenuringThreshold.

The drawback of running Cassandra with larger heap is that your ParNew cycles may take longer but if you have more CPU it will not be a problem. The benefit is majority of your expired objects will be cleaned up in young generation. At the same time you will not super log stop the world pauses.

Conclusion:

In Cassandra Garbage Collector Tuning we learned that it is a tedious task to find out sources of long stop the world garbage collector pauses. Every setup is different, and caution is advised if troubleshooting is to be done on production system. We reviewed the following topics:

What are stop the world GCs and when are they problematic;
How the JVM garbage collection works;
What are signs of stop the world GCs in Cassandra;
How to troubleshoot and collect information about stop the world pauses;
What issues in Cassandra use cases will cause long stop the world pauses and how to address each of them;

Now let’s look at the final outcome and compare how I did after the tuning exercises:

The improvement can be observed quickly. From the Heap Utilization Graph you can see that the maximum heap size was increased to 24Gb and it is utilizing 10Gb on average. You can see the cliff in the graph which is a sign of CMS GC execution, but in the Stop the World GC times graph before 20:00 there no significant GC pause. However, you can see the average GC pauses have gone up a bit since ParNew is cleaning more space. The Heap Space Utilization Graphs shows that the survivors are utilized 28% on average compare to 5% before.

References:

If you like to read more about garbage collector tuning, there are hundreds of articles on the web and I have cherry picked a few which are my favorites here:

The post Cassandra Garbage Collector Tuning, Find and Fix long GC Pauses appeared first on AryaNet.

Unlocking the Power of Data Strategy

Arya — Wed, 24 Apr 2024 16:10:00 +0000

10 minutes read

Introduction

In a startup company, during the early days you should hire full-stack engineers who are willing to learn and do anything to keep your cost low. This means that at some point an engineer is asked to setup data ingestion pipelines and BI tools so that your product team can make data-driven decisions. As your organization scales, the need to make data-driven decisions will not be limited to only product decisions. Other stakeholders such as CFO and CEO would want to look at financial metrics regularly. A mistake I have seen a CEO make early on is to segregate analysis of financial data in silos or ask each executive running a functional organization to do it their own way, and I have worked at organizations that hire folks who are Excel crunchers and not necessarily BI experts. This creates fragmentation in the organization at the cost of increased headcount, fragmentation in tools that increase cost of maintenance, and debates on what is the source of truth. And when you want to ask questions that pertain correlating data from both product metrics, financial data, and others, the solutions fall short. For example, the financial analysts will have to do rigorous repeated work each month using Excell, and pull data from various sources such as ERP, LMS and other systems into presentation slides and then would be presented while an army of executives and middles managers sit in a room. Most of the time questions arising could be answered if the charts on presentation were interactive, but this cause chaos of having to go back and recreate and represent the presentation to address the questions. This is an absolute waste of time in an era of tools and systems that can give you a bird eye view of your business as well as giving you the ability to drill down without having to reproduce anything. While the engineer wants to setup data ingestion pipelines into a common data warehouse and build BI dashboards on top of it. The job of that full-stack engineer becomes difficult and beyond their original expertise fast.

Additionally in an age of AI where data is heavily used to train models and further produce data to train models even more, the semantics of data and how they are collected and quality controlled becomes essential. For example, if a company wants to use images to train deep learning models that detect and classify objects in an image, they need to first collect high resolution images, and secondly have a way to contextually tag images for the model training which involves human judgment at first. If the company didn’t specify standards on required image quality and how they are going to be tagged as part of their data strategy, it will impact whether this feature can be created or not. Therefore, it is not prudent to not think about how to solve these problems early on in your organization.

In today’s data-driven world, organizations across industries are realizing the paramount importance of a well-defined data strategy. A robust data strategy can drive business growth, enhance decision-making, and ensure a competitive edge for the long term. In this article, we will explore what a data strategy is, delve into various data strategies, examine metrics for different teams and stakeholders, explore data team structures, tools, roles, and provide solutions for implementing a successful data strategy.

Understanding Data Strategy

A data strategy is a comprehsive plan that outlines how an organization will collect, store, manage, and leverage data to achieve its objectives. It encompasses a variety of components, including data governance, data architecture, data management, and analytics. An effective data strategy is tailored to an organization’s specific needs, aligns with its goals, and considers data as a valuable asset.

In the examples above, the chaos caused as a result of team and tools fragmentation is an outcome of not having a clear data strategy. A clear data strategy would have defined what data needs to be collected and in what way, and what roles needs to be involved in making it happen. For instance, it would have been wiser to hire a data engineer or at least a scrappy analytics engineer with experience working on data stacks to build a unified data warehouse. Additionally instead of hiring Excel experts, Data Analysts with BI expertises would have been more valuable assets.

Moreover, if the example company was interested to leverage data for machine learning and AI, a Data Scientist consultant could have been consulted on how to set the stage properly for future machine learning and AI needs if the senior leadership lacked those skills in their decision making.

These are all some of the problems that should have been solved as part of a clearly defined data strategy.

Different Data Strategies

Now that you have learned about the importance of data strategy with some examples, in this section we delve into what type of data strategy suites best for different problems with some examples that already exist out there.

Data Innovation Strategy

This strategy focuses on fostering a culture of innovation within the organization. It encourages experimentation, data-driven decision-making, and the development of data-driven products or solutions. Tech startups and forward-thinking enterprises often adopt this approach.

Data Driven Decision-Making

To implement this strategy, you should instill a culture of data-driven decision making throughout the organization, and employees should be trained how to understand, tell stories, and use data for decision making effectively. It is easy to observe if your organization lacks data-driven decision making by looking at the bottom line. If the overal performance of the organization is poor and declining, then this is a clear sign that data is not used properly to make decisions.

Experimentation

Another way to instill innovation with data is encourage your employees to run experiments. If you are a software organization and building a software product, A/B testing is the de-facto standard, and all successful companies like Google, Facebook, Airbnb, Houzz, and Dropbox have instilled a culture of experimentation. Even if experimentation leads to failure, know that innovation arises from learning and adapting. Failure is an opportunity to learn. When you observe in your team’s meetings that people debate on ideas and how to implement things with gut feelings and want to apply their experience, this is the best place to raise awareness and push for experimentation rather than gut feeling and past experiences which could certainly be biased. The experimentation is not solely limited to software but can also apply to hardware similarly as when scientists run lab tests on various pieces of hardware to understand their behavior.

Cross-Functional Collaboration

In developing a data strategy for innovation, the key element is cross-functional collaboration. Teams that operate in silos will miss opportunities that could result in discovery of solutions that could improve the overall quality of their product for their customers. Imagine in the very first example we had in this article about finance team being siloed fro product. Assume product team thinks building a certain feature would improve their product visibility and adoption of customers. They have done the qualitative research to proof what they want to build is something that customers want. However, they don’t have access to the operational cost metrics such as CAC or COGS. If the senior leadership does not see this shortcoming, the result could be catastrophic and company could end up making a product that users may like to use, but the cost of its operation doesn’t improve the bottom line and in fact burns money.

Tesla’s Data-Driven Autopilot system is an excellent example of an innovative data strategy. It collects data from their vehicles, including sensor readings, camera images, and driver behavior, to improve its self-driving capabilities. This data not only helps Tesla refine its autonomous driving systems, but also keeps their vehicles up-to-date with software.

Some lessons learned from Tesla’s story are:

Innovation often relies on continuous data collection and analysis
Real-world data is crucial for training machine learning models
Safely and compliance are paramount when implementing data-driven innovations

Offensive Data Strategy

In building a data strategy, offensive data strategy is all about leveraging data to gain a competitive advantage. It includes using data analytics and insights to enhance customer experiences, develop new products, and improve operational efficiency. E-commerce, tech, and marketing companies often employ this approach.

Customer Insights

Every company must analyze its customer’s data to understand their behavior and preferences. These insights should then be used to improve customer service, personalized marketing, and develop tailored products. Almost every service we use today has one way of giving us recommendations and information personalized to our interest. And this is the most common pattern.

Operational Efficiency

The data may not only be used to improve customer products. It can be used to optimize internal operations by identifying bottlenecks, inefficiencies, and areas for improvement. Large companies like Google and Facebook have dedicated teams built around creating custom solutions to optimize their operation. At their scale, if a solution makes the cost of operation a little optimized for a single user, at a billion users scale, it can add up to millions in savings.

Competitive Intelligence

Data is used to gain insights into competitor’s strategies, merket trends, and emerging opportunities. When I worked in a consumable company, we used third-party data providers that gave us aggregated reports about which product categories and types are purchased by customers more. Additionally it gave us insights into market share of competitors. Using these two data points, we could identify patterns of interest by customers that may have not been captured by other competitors. And have the company focus on producing products that meet those consumer need criteria.

I am a huge fan of Netflix and how they use their data to improve their product and services. Netflix’s data-driven content recommendation is a prime example of when a company employs variuos offensive data strategy tactics. They initially gathered data by renting DVDs and understood what genre users like to watch and what kinds. of actors they like. This process got easier when they started streaming. They harnessed data to personalize content recommendation, and eventually making their own movies and series titles. This ultimately increased their user engagement and retention. This strategy gave them a competitive edge to that of traditional cable TV network providers.

Some lessons that Netflix learned are:

Personalization enhances user experience and keeps customers engaged;
Investment in machine learning and analytics is key for understanding user behavior;
Data-driven decision making can drive business growth;

Defensive Data Strategy

In this approach, the primary goal is to protect and secure data. It focuses on compliance, data privacy, and minimizing risks associated with data breaches. Industries like healthcare and finance often adopt this strategy due to stringent regulations.

Compliance and Data Privacy

Any organization that deals with some sort of user data has to employ defensive strategies to comply with data privacy and protection laws such as GDPR, HIPPA, PIPEDA, or CCPA. There is a laundry list of criteria for each of these compliance laws which is outside of the scope of this article. However, a good example is that in these situations you should care for how and where user data is stored, accessed, and deleted. For instance, GDPR requires that the data belonging to a users in European Union to be stored on a server in EU or a jurisdiction that abids by European laws.

Risk Management

Data related risks should be identified and mitigated by implementing security measures, encryption, access controls, and data audits. When I worked in cannabis industry we gathered usage data from user’s vaporizers. A common risk we had to deal with was to how store the user behavior data on our data warehouse so that nobody could decipher it back to the user, and only user had the control over their actual data. The approach was to use a special kind of encryption that user controls its key of. We stored all the data anonimized and still could use it to study user behavior and create personalized features, but ony user can see and attribute the raw data to themselves. And when the user requested data deletion, all of their data in our data warehouse would become orphaned.

Incident Response

It is important as part of a defensive data strategy, to develop clear protocols and response plans for data breaches or security incidents to minimize potential damage. I believe every organization needs an incident response plan once they store any user data that can be abused.

Because in the sharing economy, trust is paramount, Airbnb employs a defensive data strategy to ensure safety and privacy of its users. They use data to identify indentities, monitor user reviews, and detect fraudulent listings. In addition, they comply with regulations and data privacy laws in various countries, earning the trust of their hosts and guests.

Some lessons learned from Airbnb’s story are:

Trust and data security are crutial for marketplace platforms;
Compliance with local data protection regulations is vital;
Data-driven security measures can prevent fraud and build trust;

Enterprise Data Strategy

In my opinion, enterprise data strategy is another term for simply data strategy but bloated with the term “enterprise” because big fortune 500s like to call themselves enterprises, but really there is not that much difference in having a solid data strategy. In a nutshell, this is a comprehensive framework designed to ensure that an organization’s data-related initiatives are aligned with its overall business objectives. Unified methodologies are used to collect, manage, and process data across the entire organization while maintaining data governance and security.

In some organizations defining an enterprise data strategy would require a lot of leg work across different organizations, hence some like to hire a Chief Information Officer to be responsible for this.

Implementing a Successful Data Strategy

We discussed earlier that a robust data strategy can drive business growth, enhance decision-making, and ensure a competitive edge for the long term. With this being our goal, we can gauge on effectiveness of our data strategy throughout the lifecycle of our product and business development. Here is a basic plan on how to come up with a data strategy:

Define Clear Objectives

Whether we want to improve customer satisfaction, increase revenue, or optimize operation’s cost, we need a prioritized list of our goals first.

Data Governance

Some may delay or omit this item from their plans, but from my experience it is essential to have a sense of a governance framework early on because it is going to be magnitudes more difficult and time consuming to do this once there are large amounts of data has already been collected. For example, if you never thought about how to cleanup user identifiable information from the data you collected from user activity in your data warehouse, you will not be compliant with policies like GDPR, or CCPA, increasing your risk of lawsuit.

Invest in Data Quality

This is likely the most critical aspect of any data strategy implementation. A report by IDC (International Data Corporation), states that companies lose 20-30% of their revenue due to poor data quality. This speaks to the scale of inefficiency and rework.

Adopt Advanced Analytics

Say goodbye to spreadsheets (sorry finance people, I can do everything in BI), and employ advanced BI tools such as Tableu, Looker, Superset, of PowerBI.

Data Culture

It is time to say goodbye to your experience, and gut feelings. Everyone in an organization should value data, and have the data to support their arguments. Make this culture hard, and don’t let anyone slide by it.

Continuous Evaluation

Japanese coined a term for this called Kaizen. Like anything that human invented going obsolete at some point, you must keep an eye on latest trends and technologies to stay on the edge with your data strategy.

If you like to read more about a real world scenario of a data strategy for a business, read my other article “Implementing Effective Data Strategy – A Cannabis Company’s Journey”.

The post Unlocking the Power of Data Strategy appeared first on AryaNet.

Managing Technical Debt for Vitality of Successful Software Teams in 2024

Arya — Mon, 02 Oct 2023 08:49:22 +0000

12 minutes read

Introduction

Technical debt is a concept that has gained significant prominence in the software development industry over the past ten years. However the concept was introduced and debated as early as late 2000s. It refers to the cost a software project incurs when developers take shortcuts or make suboptimal design and implementation decisions during the development process. Over time, this debt accumulates, making the codebase increasingly difficult to maintain and extend. This article explores the definition of technical debt, its types, and the best practices for managing and mitigating this risk.

Definition of Technical Debt

Technical debt is akin to financial debt. In the realm of software development, it represents the long-term costs associated with hurried or subpar development decisions. These shortcuts often seem expedient in the short term, allowing developers to meet deadlines or deliver features quickly. However, they typically result in future challenges and increased costs, both in terms of time and resources.

For example, assume you are developing a simple API to serve a mobile application. Your development team has skills in mobile development, and backend development. However, you are lacking DevOps skills in your team.

The CTO is told my the CEO that the deadline to launch the application has moved forward by two weeks because of the client demand. In this situation everyone wants to have a happy customer, therefore, your development team may do a good job in pushing for the deadline, but they have to cut corners as a result. They don’t have enough time to ramp up their DevOps skills and learn how to launch the application with development process in mind. So they rapidly make a container of the backend code, and launch it in AWS ECS.

Notice in this example, we didn’t evaluate how to establish the proper software development lifecycle using unit testing, integration testing, and CI/CD tools. They all leave that behind for the future. For the short term, the launch goes well, but then as customer requests come in, the team cannot rapidly develop.

The development cycle is impacted negatively. They have to go back and do manual testing, and their release cycles fall out of a regular cadence. It will take more time to build features, and this is akin to financial debt. The CTO at some point needs to decide to fix this process because otherwise the prologued impact could be catastrophic to the business.

Ways Technical Debt is Created

There are different ways technical debt is introduced into software. Steve McConnell in 2007 hypothesized that there are two categories of technical debt: intentional and unintentional. In 2009, Martin Fowler expanded upon it to create the “Technical Debt Quadrant”. This was based on knowing if the debt was deliberate or inadvertent.

Source: Martin Fowler, 2009

Intentional Technical Debt

In the example above, the opportunity cost to not ship the product and miss customer expectation is detrimental to the company’s success, hence in the first iteration the testing and DevOps best practices were ignored. Perhaps in the first few iterations, features could be tested manually, but as the pace picks up, the opportunity cost to not implement DevOps best practices outweighs the benefit.

Developers sometimes decide that the technical debt is worth the risk to launch their software in the market. This is acceptable if the debt is monitored and watched from the time of implementation.

In this case the tech debt needs to be backlogged, tracked and repaid before it becomes burden for the team.

Unintentional Technical Debt

Developers might have designed the software carefully with a lot of thought. However, after implementation and launch, they could realize they could have designed the components of the application better.

This debt comes from bugs and lack of attention or focus from developers. The issues keep adding on.

Accidental Technical Debt

This is where the foundation of software starts off on the wrong path. Imagine developers have to build on a legacy software that nobody is familiar with its codebase. They build upon that broken foundation not knowing the somewhere some poor decision was lead to the poor design. So they have to stop, asses the risk, and track back where the debt started and address it.

Impact of Technical Debt

Prevalence

According to a survey conducted by CAST, a software analytics company, technical debt is present in 76% of software projects. This prevalence underscores the fact that many development teams encounter technical debt at some point in their projects.

Impact on Productivity

The Consortium for IT Software Quality (CISQ) estimated that technical debt can cost an organization up to $1 million per month in lost productivity and rework. And based on the 2022 survey, the cost of poor software quality in the US alone has grown to at least $2.41 trillion. Technical debt is one of the major pillars contributing to this concern. This statistic underscores the severe financial impact of unchecked technical debt.

Maintenance Cost

Source: Maurice Dawson from IBM Systems Sciences Institute

The IBM Systems Sciences Institute found that maintenance accounts for 40-80% of software cost. A significant portion of these costs can be attributed to addressing technical debt.

Types of Technical Debt

Design Debt

This occurs when a software project lacks proper architectural design, resulting in systems that are inflexible, difficult to modify, and prone to errors. Developers must invest substantial effort in re-architecting to address this type of debt.

A prominent example of this is the choice between using a monolithic architecture vs a service oriented architecture. CTOs know that at some point their monolithic application built on a single codebase, is not going to scale with their rapidly growing team size.

In this situation, a decision needs to be made to re-architect the application into separate service all talking to each other via APIs as the contract between them. That way, if team A wants to change a part of the system, the codebase will not be heavily coupled with other parts of the system, and this significantly reduces the risk of errors. Importantly, each team can have their own cadence in developing their services.

For example, AI/ML projects will take more time to accomplish. Assume that you are integrating a generative model into your application support module to aid with customer responses. By the time your AI team has finished their work, the monolithic codebase has evolved so much in other parts, that they would feel they always have to play catchup in keeping their work integrated properly with the rest of the application.

Code Debt

Code debt arises from poorly written, unoptimized, or excessively complex code. It hampers readability, maintainability, and the ability to add new features without causing unintended side effects.

A example of the code debt is when developers don’t use design patterns to address repeatable solution to commonly occurring problems in software design. This causes each developer to maybe unknowingly implement different solutions for the same problem. Recall the good old mantra: “there are more solutions than problems!” This will result in what we know as spaghetti code.

Lets assume there is an common object in your codebase that is used by multiple consumers. As the object evolves based on different consumer needs, the developers likely have to add more class variables, methods, and constructor parameters. But not all consumers use this object the same way. Hence, each developer, creates their own subclass from this object, and overrides the constructor parameters. Although this sounds like a good use case for inheritance, but now we have many children of this object that their sole purpose is to override how this object is built. There are a lot more code to maintain in different files.

This is where a design pattern like Builder comes into play. Its sole purpose is to abstract the way an object is built in an elegant and easy to maintain fashion. So, it is encouraged for developers to master design patterns, and this is why a lot of interview questions cover this topic.

Testing Debt

Inadequate testing or the absence of automated testing can lead to testing debt. This makes it challenging to identify and resolve issues early in the development process, causing defects to accumulate over time.

Take our first example in this article about the mobile app with a backend API. The developers ignored the need for automated testing. This impacted their development cycle speed. It is crucial for any software team to have automated test coverage that can give the team confidence in making changes without having to know all the functionality that might break as a result of their change.

Documentation Debt

Documentation debt occurs when project documentation is insufficient or out of date. This hinders the onboarding of new team members and creates confusion during maintenance and future development.

A lot of us have heard the mantra: “the code is the documentation.” One can assert that their code base is cleanly written, and all object names and variables are meaningful, so it should be easy to understand. However, it is extremely challenging for a new employee to setup their development environment and figure out how services talk to each other without even a bare minimum automated setup script or documentation showing the architecture.

Unfortunately, the examples of this exists in even mature teams where a simple API documentation that can be automatically generated by using tools like Swagger could be non existent.

Dependency Management

Regularly update and manage third-party dependencies to avoid dependency debt. This also helps ensure the security and stability of your software.

Nobody these days writes every single line of code from scratch. There are libraries that are open-source, and make the job of a developer easy. However, developers must be conscious in choosing dependencies for their project that are actively being developed and are not dormant. Otherwise, there could be potential security issues that could be a nightmare to fix if a dependent library is not supported, and you may have to rewrite a significant portion of your application to incorporate a different library or patch and own the existing library.

A rule of thumb is to not use any library that doesn’t have any activity for the past 3 months on its code repository. Except proprietary libraries that should be supported by vendors at all times, this can be examined via Github easily.

Measuring Technical Debt

Now that we have talked heavily about different kinds of technical debt with some examples, it is important to understand how to look at some technical debt metrics.

New Bugs vs. Closed Bugs

Every bug could be accounted toward a tiny technical debt. Software teams need to keep a tally of the number of bugs create vs. the number are bugs that are being closed. If new bugs are outnumbering closed bugs, it is a signal that some changes need to be made.

Code Complexity Metrics

Complex code is an absolute sign of accumulated technical debt. And at some point, someone would need to go an open the can of worms and take on some major refactoring.

There are several factors that go into measuring the complexity of code:

Lines of Code (LOC)

The total number of lines in a codebase. While not a direct measure of technical debt, a large LOC can indicate a more extensive codebase that may have accrued technical debt.

Nesting Depth

Tracks how deeply code structures (loops, conditionals, etc.) are nested. Excessive nesting can make code harder to understand and maintain.

Class Coupling

This measures how many objects in the code are dependent on one another. If too many objects are couples with each other, it would cause changes to be error prone. Best way to address this issue is incorporating proper design patterns and inheritance.

Inheritance Depth

This metrics shows the depth of inheritance of classes. If the depth of inheritance is high, it is likely the sign that lazy approach was applied when developing code, and likely some of the children logic can be implemented in parent classes and a more elegant code can result from incorporating design patterns rather than inheritance each time.

Cyclomatic Complexity

Measures the number of independent paths through the code. Higher complexity may indicate more challenging code to maintain.

There are tools that later we will talk about that would help teams to measure this, but the general rule of thumb is for teams to aim for the lower possible score for each one of these factors.

Code Smells

Qualitative issues in the code that suggest potential problems. Common code smells include duplicated code, long methods, and excessive comments. Tools like SonarQube can help identify code smells.

Code Churn

Code churn is measured by counting the number of times a line of code has been deleted, replaced, or rewritten. The reason this speaks to debt is that if the original design of code was thoughtfully done with reusability and modularity in mind, then it shouldn’t be changed quiet often. Some churn is always inevitable, but after a feature release and bug fixes are done, this metric should be minimized.

If teams see high churn over a long period of time on a particular area of the code, it likely means mistakes were made and quick fixed are being applied.

Test Coverage

This means that how much of your code is being executed when automated tests are ran. This can tell you how efficient the code has been written. When more code is unused, it is likely the sign of a poorly written code. The rule of thumb is to have at least 80% coverage in your automated tests. Anything less than that means that some bugs can arise from lack of enough tests and some changes need to be made.

Static Code Analysis

Each programming language has some lint tools. Tools such as ESLint, TSLint, and PyLint provide insights into code quality, potential issues, and adherence to coding standards.

Regression Rate

Measures how often code changes lead to regression issues, i.e., previously fixed issues reappearing. High regression rates can indicate testing and code quality problems that could be the sign of technical dept.

Code Ownership

As a CTO you want to ensure that your have enough people working on your projects so that if one person is on vacation, and one person gets hit by the bus, you still have a third person to work on the code, otherwise, the show stops here. So having at least three people in a team that work on the same codebase or service is an ideal code ownership.

Perhaps no company every has enough resources to dedicate three people per project, and some people may work on multiple projects. It is not the segregation that matters here, but the average figure across knowledge of the codebase within your team that matters. Otherwise, you cannot delegate property, and we know that having someone unfamiliar with something to do something, it is likely going to result in poor code quality specially without having any of the former contributors around.

Technical Debt Ratio (TDR)

Calculated as the ratio of the estimated effort required to fix the debt to the overall effort invested in the project. The higher the ratio, the greater the technical debt.

(Remediation Cost / Development Cost) x 100 = TDR

Remediation cost can be calculated as a function of the code quality metrics mentioned above.

The effort or development cost is measured by calculating the number of lines of code needs to be written for a product feature, divided by the average resources expended per line.

Development Velocity

This is the time that elapses from the first code commit to a successful deployment. If the time it takes to implement new features is impacted by fixing bug over time, it is likely the sign that quick fixes are being made in each iteration.

Dependency Analysis

Assess the dependency on third-party libraries and components, checking for outdated or vulnerable dependencies. If there are non-addressed issues, then technical depth has accumulated. Nobody likes getting hacked because some package was vulnerable.

Application Performance

If your application or front-end performance is not optimal, it is not a direct sign of technical depth, but it is a warning sign that likely some technologies are outdated or your developers are not paying attention to the performance metrics.

Tools to Measure and Track Technical Debt

So far we talked about what is technical debt and learned about different types of debt and how they are measured with some examples. Now let’s explore the vast majority of modern tools that exist today that you can use to measure and track your technical debt.

Stepsize

Stepsize is favored by a lot of developers because it seamlessly integrates into the most popular IDE, VSCode. It is designed to help developers identify areas in the code as their are working in their IDE and create issues pertaining the poor design and refactoring and other, from a click of a button inside the IDE. It can practically replace your backlog in issue tracker or even syncs with it and replace documentation in the code to keep it clean. It is free to try for a week but it comes at a price after that.

Source: Visual Studio Marketplace

SonarQube

SonarQube is one of my personal favorites from years ago and its purpose is to measure and improve code quality. It highlights potential bugs, code smells, and can also find potential security issues as a result of bad coding practices using static code analysis. It has nice APIs that can seamlessly integrate with many CI/CD platforms so that you can give your developers visibility into their code as part of your development lifecycle. Lastly, it is a known enterprise grade code quality tool that is very well respected as part of any organization seeking compliance as part of their development cycle.

Source: Sonarsource.com

Velocity by CodeClimate

Velocity by CodeClimate seamlessly integrates into your source control like Github and it monitors your development process based on pull requests, issues resolved, code reviews, and how many lines of code have changed. It is the best tool to measure code churn and gage on the quality of the code your team is delivering.

For example the chart bellow shows how much of the code written is new vs. reworked.

Linter Tools

The linter tools are available for almost all programing languages. Checkstyle for Java, PyLint and Flake8 for Python, ESLint for JavaScript and Typescript and CSSLint for CSS. These tools are designed to perform some static code analysis and have robust command line tools that you can use to integrate with Github pre-commit hooks. The advantage of doing it this way is you can create strict coding style and some quality checks provided by these tools right into your code base and prevent developers from being able to commit code without all checks passing. These tools are free.

Dependabot by Github

Dependabot scans your code repositories for outdates package dependancies. This is good for capturing all those package that need upgrading due to security vulnerabilities.

Conclusion

Technical debt is an inevitable part of the software development process, but it is manageable. Recognizing its existence, categorizing it, and implementing effective strategies to address it are essential steps in maintaining a healthy codebase. By proactively managing technical debt, development teams can reduce the long-term costs and risks associated with suboptimal development decisions, ultimately delivering more robust, efficient, and maintainable software products.

The post Managing Technical Debt for Vitality of Successful Software Teams in 2024 appeared first on AryaNet.

Web Scale Image Service

Arya — Fri, 19 Sep 2014 20:52:00 +0000

6 minutes read

In this article I’ll describe the journey to architect and build a web scale image service. This could be the way some very large services like Pinterest, and Instagram host and serve their images.

Recall the previous article where we discussed scaling a sharded image service in a data center? We walked away with a solution that was low risk as time was our limiting factor in our decision. However, it really didn’t solve the core architectural problems. The solution was not at web scale. Let’s review those problems:

Not Durable: the durability of storage didn’t exists beside the copies of images in the single data center.
Not Performant: the performance of the service would only be good enough to service north America regions and not elsewhere. Moreover, the performance depends on availability of large cache nodes.
Not Scalable: there is no way to easily setup POPs and scale the infrastructure for global traffic.
Expensive: operational costs added up quickly.
Hard to Maintain: we learned from our past exercise that scaling the cluster would take a significant amount of effort. Additionally, the execution time to rebalance the cluster is directly proportional to the size of storage.

Web Scale Image Service Architecture

Storage

Amazon S3 is a good choice for storing images in the cloud. It has a durability of 99.999999999% and availability of 99.99%. However, we need to prefix our objects in S3 so that we could achieve the maximum read throughput from S3. Read the following AWS documentation to learn how to organize objects using prefixes.

Since the image files are hashes, we take the first 6 characters for form object prefixes in S3.


1
2
3
4
On local disk:

/home/images/3972b16287e595a88a50624587bccc66ece0f4c7.jpg

On S3:

s3://my-image-bucket/3972/b1/3972b16287e595a88a50624587bccc66ece0f4c7.jpg

This has a couple of advantages. Because S3 shards the data by prefix, prefixing objects distributes them on different S3 clusters by the first 6 characters of the object. When we have a large cache miss ratio due to caches being cold, the load would not hit a single S3 cluster. Therefore, this solves the thundering herd problem with S3 requests. Additionally, let’s say if we’d want to scan the images in the bucket in a particular range, we don’t have to scan the entire bucket and scanning with prefixes would be a much faster option.

Please note that for any prefix scheme to design, you should file a support request with AWS informing them of your scheme and desired throughput. The AWS team would do some magic in the backend to ensure their service can handle your request with milliseconds latency at that throughput.

Application code was setup to write to both image shards and S3. We also setup a nightly syncing process between image servers and S3. Additionally, we setup a bucket in another region to have a multi-region image storage. Hence, we would have even higher durability and performance in case of a cache miss.

Dynamic Image Resizing

We used Image Magic library with its PHP bindings to write a simple API script that would read the images from S3, perform resizing on the fly and return the output to the cache servers. There are many ways to host this API.

Dynamic Resizing at the Edge with Akamai

We were exploring CDN options to replace Varnish cache. Akamai at the time had launched a solution within their Image Manager product suite that did the image resizing at the edge. We performed several load tests. However, we couldn’t achieve the desired performance in the cache miss scenario. This could be due to several factor involving dynamic routing of Akamai’s SureRoute and where our us-west-2 Oregon origin was relative to Akamai’s POPs. We worked closely with the Akamai team, but the root cause of performance issues was never 100% remedied. So we moved on from testing this solution further.

Host EC2 Instances

One solution would be to host EC2 instances with Apache and PHP FastCGI. This is likely the classic solution of hosting machines with auto-scaling to run the API code. Additionally, our existing configuration management already had packages to setup this machine role, so we were only decoupling the storage to S3 modifying our PHP script tiny bit to read objects from S3.

Load testing this solution revealed desired performance can be achieved in both cold start and warm cache operations.

When we worked on the above solution in summer of 2014, AWS didn’t have the managed Kubernetes service nor Lambda. Ideally even more cost effective solutions could exist. I added these section to this article for better reference of the future evolution of this service.

Kubernetes

At the time of writing this article, Kubernetes had just started to gain momentum in the software community. The team didn’t have any prior experience with running things on K8S, so there is likely a longer time to evaluate this solution and settle with it. We simply skipped this for time constraints.

AWS Lambda

AWS Lambda was released on November of 2014, just 2 months after we launched the solution with EC2. The simplest way to host this API in a server-less fashion is to use API Gateway with Lambda function. This is also the most cost effective way. However, once performing load testing on this solution, the performance can be spotty. This attributed to how Lambda functions are provisioned and persisted in AWS. We had the request load metrics from our last cold start and warm operation of the image cache. So we could eventually follow some AWS Labmda performance optimization tips to tune the Lambda configuration to have persisted concurrency for cold start scenario before launch and then later tune it down.

This would be the ideal cost effective solution upon prove of performance. It is server-less and has a very low maintenance cost.

CDN Caching

We were running several proof of concepts with different CDN providers starting with Akamai. Since couldn’t beat our baseline performance with Akamai, we knew we had to setup our dynamic resizing solution ourselves as explained above. This opened up the opportunities to explore a wide variety of CDN providers that didn’t have image resizing in their solution. We chose Fastly because our load testing revealed highest performance compared to other CDN providers and we significantly beat our baseline too. Another plus point for using Fastly was its ease of setup. Since its backbone runs on Varnish and we were very familiar with Varnish, it significantly simplified our deployment and we could move several of our Varnish VCLs and security configurations over to Fastly in a breeze.

Load Testing

We performed load testing via Apache JMeter. We designed our test plan by downloading a list of image URLs from requests in the server logs. To capture a good range, we took the data for one week. Then used JMeter UI to formulate a test plan. This would save the test plan into XML format. We took the XML as our base test plan, and since the CDN domain was parametrized, we could simply switch it in place via a script. A CloudFormation stack would spin up EC2 instances in various AWS regions to simulate live traffic from multiple locations. In the EC2 instance’s startup user script we made all the changes we need to the XML and fired up JMeter to run its test and then uploaded the resulting JTL files to S3. Later we used JMeter and loaded the JTL files and looked at the plots to get average, and standard deviations for request latencies. At the same time, we monitors cache warmup by looking at metrics like hit ratio.

Launch Plan

The cutover plan to this new image service was straightforward. To pre-warm the cache, we setup a worker that listened to existing infrastructure’s requests, and fired up a new request to the CDN URL. This means that in the background we mirrored live traffic against CDN. This helps with warming up LRU cache in Fastly Varnish as more requests sent to the same image, would make it persist in cache memory. Once our cache hit ratio was above 80%, we decided it is a good time to cut over, and in the span of a week after cutover, we were able to achieve 92+% cache hit ratio from Fastly which was excellent. Our cache misses at max had the latency of 150ms, and cache hits were served all under 20ms.

User Experience Impact

The user experience was impacted as part of this transition. We noticed 15% uplift in DAUs and active session length. This speaks to the long term belief that performance does impact user experience indeed.

Conclusion

Fastly Cache Performance

In conclusion our new solution resolved all of the outstanding pain points in this service architecture. We achieved durability by using S3. We setup multi-region S3 syncing in case we wanted to have multi-region availability later on. Our infrastructure cost for image service was reduced to fraction of the bills we used to pay for renting servers and paying for bandwidth in various data centers. It scales seamlessly without needing manual intervention of a human. Overall, we built a web scale image service that is tolerant, performant, and lasts for good.

The post Web Scale Image Service appeared first on AryaNet.

Scaling a Sharded Image Service

Arya — Tue, 20 May 2014 20:38:00 +0000

12 minutes read

I once helped a company that its core product was relying on high resolution images, and each page load on its website and mobile apps loaded 60-100 images easily in a single page request. Thinking like Pinterest or Instagram as users browsed the service, the volume of these requests would be staggering and having a performant service infrastructure to serve images directly impacted user experience and revenue. In this article I will show you my first journey with scaling the existing architecture, lessons learned, and lay out the path for a future design that would auto-scale for good.

Please note that as a security cautious leader, I have left out details pertaining the security controls existing in this infrastructure architecture and nothing I share here reveals any secret sauce relating to that.

Background

Since the bread and butter of this business is based on images that its users upload to its service, the durability of its storage is a must have criteria. Second to that is performance and availability of its service because just a single thumbnail loading slow or not loading on a page negatively impacts the user experience. Upon taking over this project, it was quickly realized that at the current user upload rate, the existing image cluster would run out of space in less than 2 months. It is 2014 and nobody should be worried about space issues any more with the advancements in public cloud technologies like Amazon S3, but unfortunately the infrastructure in reference was not hosted on the cloud and it was hosted by a data center managed service provider, hence it came with its own challenges of scaling.

Current Architecture

Before we dive into problems and solutions, let’s understand how the existing infrastructure architecture works and what problems we are facing.

Figure 1 – Original Sharded Image Service

There are two managed date center colocations in this architecture as depicted in Figure 1. The one on the left is the primary data center that hosts everything from application servers, backend databases, caching servers, search indexes, and image servers. For the purpose of focusing on image service, the diagram is only showing components that involve storing and serving images, and all other components have been left out. The data center on the right only hosts a caching layer which uses the primary data center as its origin backend. You can think of this being like a CDN point of presence or POP. The reason for this multi region architecture is performance as this company served the entire north America at the time.

Multi-Tiered Service Architecture

Since browsers can handle a limited number of simultaneous requests to a domain, the domain that served the web page was the primary domain like www.mywebsite.com, and a separate static domain was used to server the images like st.mywebsite.com. The static domain is also a cookie-less domain meaning the request and responses don’t carry out any cookie payload. This is a micro-optimization tactic which is the best practice for static services to keep the request payloads low as cookies can be as large as 4KB in size and usually trigger server side processing.

The client browser first sends the page load request to the primary domain which is hosted by the US-East colocation only and receives the HTML back. Then the browser parses the HTML and detects images coming from a separate static domain, and starts to asynchronously fetch them 10 at a time. The static domain is configured via a Geo DNS service like Dyn. It resolves the domain to an IP address pointing at either US-West colocation or US-East colocation based on the proximity of the user to each location for faster connection establishment and date transfer rate.

Inside each colocation there is an active-passive pair of Nginx+HAProxy load balancers that terminates the connection and routes the request to the appropriate cache backend. At the time HAProxy didn’t have TLS support, and a common load balancing stack used Nginx for TLS termination and HAProxy for its core competency which is load balancing. Varnish was used for caching images and assets like JS/CSS, and there were two clusters of two super high memory machines in each colocation (total of four boxes) and each cluster was responsible for either static sized images or dynamic sized images with failover setup between the two clusters. Unfortunately US-West was the only colocation with the source images residing on separate image servers, so US-East didn’t really have its own image source origin backend and instead used the US-West as its origin backend. This is actually how most CDNs work when the source origin is in one place, however, that one place is a huge risk if it comes to the colocation getting hit by a disaster.

The cache servers used a cluster of machines with large volumed disks in RAID1 and RAID0 format as their backend origin. This is where the actual image files were stored. Since there were 10s of terabytes of images, vertically scaling the image servers was not a scalable option, so a sharding strategy was implemented to distribute the images horizontally among multiple active-active pairs of boxes. I will discuss how this logic was done in the next section. Varnish configuration was programmed to serve the images from its cache upon hit and was able to detect which image shard to go to as its backend origin in case of the miss.

Consistent Hashing Strategy for a Horizontally Distributed Storage

The images are uploaded from the client browser through the load balancer to the application server. The application service uses a consistent hashing strategy to determine which image cluster shard the image would fall into as well as creating multiple resized thumbnails of the image.

Here is an example:

An image file is uploaded. To avoid collision in hashing based on the image file name, owner username, and upload timestamp was appended to the file name and then a SHA1 checksum of the resulting string was computed (We are not going to talk about SHA1 collision probability here as it is too low. You can nerd about it from this Google blog post). To determine which shard the image falls into, she first 4 characters of the resulting hash was taken as it was used as the shard token for that image.

We call this a consistent hashing strategy because the resulting shard token will be based on the file name and is consistently producible by the same algorithm.


1
2
3
shard_count = 32

file_name_hash = sha1(original_file_name + username + time_microsecond)

shard_token = file_name_hash.substr(0, 4)

The file_name_hash was used as the image metadata in the application database and that is what was rendered in the image URL along with some metadata appended for SEO optimization, and shard_token was used to map the file name for upload to a specific pair of image machines. Before we get to the shard token, let’s see how the token range is defined and assigned to each machine. Since SHA1 produces a string with hexadecimal characters, taking the first 4 bytes of the string as the token, will give us the following total token range:


1
2
total_token_range = [0000, ffff]

bucket_size = ffff / 32 = 51e

Now assume we have 32 pairs of image machine which translates to 32 image shards. To evenly distribute the image load, we need to divide this range to 32 equal buckets and assign each bucket to a pair of machines. The upper bound of bucket range was stored in the application server configuration as the hash key so that when the image is uploaded, the application server can take the shard token, and iterate over token ranges performing an arithmetic comparison against token upper bounds to determine which image server to copy it to:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
image_cluster_token_map = {

  "051e" => Array("images-m-01.internal_dns", "images-s-01.internal_dns"),

  "0a3c" => Array("images-m-02.internal_dns", "images-s-02.internal_dns"),

  "0f5a" => Array("images-m-03.internal_dns", "images-s-03.internal_dns"),

  ...

  ...

  "fae1" => Array("images-m-31.internal_dns", "images-s-31.internal_dns"),

  "ffff" => Array("images-m-32.internal_dns", "images-s-32.internal_dns")

}



foreach(image_cluster_map.keys() as token) {

  if (shard_token < token) {

    return image_cluster_token_map[token]; 

  }

}

The above consistent hashing sharding strategy can be easily implemented on the Varnish cache servers where the input would be the image URL that has the file_name_hash, and therefore the origin image cluster can be identified upon cache miss by performing the same arithmetic logic in Varnish VCL.

Problems

As you have digested the above architecture, you can easily spot multiple challenges with it:

Durability: even though there are pairs of image servers, and durability exists against one machine failing, they all resided in one colocation. If this colocation gets hit by a disaster, the entire business is gone as there was no business continuity plan or backup in place.
Performance: if any of the Varnish nodes went down, there would be a hotspot in the other Varnish node likely resulting in the LRU cache to evict objects quicker and lowering cache hit rate and reduce performance of service as a result.
Scalability: This architecture is definitely not at web-scale. The company was planning for a large international expansion at the time, so there was no way to easily move fast and setup more POPs to serve international traffic with the desired performance SLA.
Expensive Infrastructure: each POP costed thousands of dollars per month to maintain, specially the 10G uplinks that aren’t cheap.
High Maintenance Cost: the image shard running out of space is a big issue specially when it is relative to user upload volume which could grow fast out of our control. So, you always have to think ahead and do capacity planning and plan to have enough head room to allow for unprecedented growth. And as you see from the architecture, to avoid hotspots, each time we want to scale the image shards we have to double the capacity to evenly distribute the load avoiding re-tokenization and balancing which is an expensive operation in distributed systems like this.

Solutions

Time was a huge factor since the cluster would run out of space in 2 months. So, I had to evaluate multiple options with their pros and cons:

Move to the cloud: the ideal option would be to replace this entire setup with AWS, leveraging S3, a few EC2 machines in auto-scaling mode to perform the dynamic resizing, and use a CDN in front. This would be my ultimate solution for good. At the time, there were a lot of beliefs existed in the team that S3 and cloud solutions aren’t performance and cost effective. I was the only cloud guy in the team, so I had no time to experiment with a POC and do performance testing to demystify these beliefs.
Replace the architecture with a scalable storage solution: instead of scaling the existing setup, I could have replaced it with another system like Ceph. This could have been a good solution for keeping things running in the data center and abstract out the storage and eliminate all of these manual sharding strategies. However, at this time, Ceph was just open sourced by the Dreamhost team who built it, so there was a risk with adopting a new open source component, and since this required architecture changes that are not battle tested with our workload, I had to undergo a massive re-architecture and load testing to make sure it can deliver the same performance SLAs. Two months seemed short, so the risk was high.
Scale the current architecture by doubling its capacity: since this architecture was proven to be performant to service the entire north America market, I could confidently say that I don’t need to perform rigorous performance testing, and I just need to double the number of image shards. My only bottleneck was whether the managed data center provider can give me more servers fast enough so that I can perform the migration on time before I approach the two months timeframe.

Scaling the Current Architecture by Doubling its Capacity

Here is the step by step plan of execution for this strategy:

Define hardware spec and order it through the managed date center provider in US-West region
Setup the servers using the configuration management tool which was Saltstack at the time
Implement dual-write strategy in the application so new images are copied to both existing and new shards
Run a migration script asynchronously to copy half of the images based on splitting the token range from old machine to the new machines
Change Varnish cache to use all 64 image machines
Cleanup old servers and remove half of their load which was moved to new servers

Normally when you order hardware, depending on what hardware architecture like chassis and CPU that you order and the status of supply chain, it could take anywhere from 2-6 weeks for them to arrive at the colocation. I just hoped that this timeline is closer to 2 weeks rather than 6. Luckily the image servers where simple 1U blades, with Xeon E3 processors, and Hitachi hard drives. This was the bare bone server architecture used for all general purpose application workloads. You can think of them being like M class in AWS EC2. Therefore, the data center folks, since they run their own cloud business too, had a lot of these servers in stock and ready to repurpose. My only additional requirement was the RAID controller and extra Hitachi drives to give more storage capacity to each box. So, I asked them to setup 32 more boxes with 16 being on rack A and another 16 being on rack B where other image boxes lived. Each rack had its own switch and separate power source so at least we’d get power and network redundancy in the single colocation. It took two weeks for the machines to get ready for provisioning.

Saltstack was used to automate provisioning of the machines and configure each box with necessary services. Next step was to assign new token ranges to each box and implement a dual write strategy for new uploads as part of the phase 1 of this migration.

Phase I – Splitting Token Ranges and Dual Writing Strategy

In order to evenly distribute the load of existing boxes, we must have split the token ranges that they are responsible for. This is why we doubled the number of shards from 32 to 64 for the total of 128 servers.

Figure 2 – Phase 1 : Dual Write Strategy

In the original setup, there were a total of 64 servers and 32 shards/buckets. The pairs were setup in different racks with different power source and network sources for redundancy in the local colocation. The size of each server’s bucket was 0x051e. Varnish cache nodes were reading from all servers and application servers were also copying newly uploaded images to the server pair which owned the token range for those images. This is depicted in Figure 2 on the left.

The right side of Figure 2 shows the new architecture. We added a new pair of server for each existing pair that claimed half of the token ranges of the existing server pair. This means we split the initial token range of 0x051e by half which comes out to 0x028f. So taking the first pair of servers, if the token range used to be 0x0000-0x051e, the new token range for them would become 0x0000-0x028e and the newly added server pair would claim 0x028f-0x051e. However, the newly setup servers would not have all the images for their token range until the asynchronous process syncs them all over. Therefore, the token range of the existing servers in this phase remains untouched. We just add a new token range configuration in the application server for the new servers, so they get a copy of recently uploaded images after this configuration was setup. This defines our dual-write strategy. Additionally, Varnish cache servers are still only reading from the old server pairs until we confirm synchronization process is done.

At this stage we also kick off a script that runs on one of the servers on each pair which lookups up for images that belonged to the new shard, and copies them over. We also use I/O throttling flags in rsync to avoid slowness in service upon cache missed on Varnish.

We need to wait for all syncs between old and new servers to complete, before we move to the next phase. This process for the images took 2 weeks to complete.

The metrics to monitor in this phase are:

Cache Miss Request Latency: this value is going to increase a bit because our copying process is taking a lot of I/O bandwidth, but if it increases drastically to a point that service SLAs are impacted, then you’d want to think about strategies to isolate the slowness. In our scenario, we used rsync’s throttling mechanism to slow down our process.

Phase II – Reconfigure Token Map

At this point, the new servers have the full copy of the images that they are responsible for serving based on their token map. The old servers however, still have the extra half which we will cleanup after there are no issues in the traffic.

Figure 3 – Reconfigure Token Map

In this stage, we want to remove the dual write mechanism from app servers to both old and new image server pairs. We simply do this by reconfiguring the token map assigned to each server pair to only own their equal share of tokens. At this stage we also reconfigure Varnish cache nodes logic to handle the new token map and use all server pairs as backends.

Good metrics to monitor to ensure everything is good in this setup are:

Cache Hit Ratio: this metric should not be affected as we only changed the origin backend.
Cache Miss Request Latency: this is the latency for loading images from origin server pairs. Since we doubled the capacity, the latency should be decreased since I/O should be less stressed. This metric is a proxy for I/O performance, but if we desired to drill down, we could also look at disk I/O latency which could have gone down since each disks load is split by half.
Cache Miss Request Status: this is the HTTP status code of the requests being sent to the origin server pairs. If everything is copied fine and the configuration maps are correct, there should not be any increase on 404 Not Found status codes unless the requested file name is bogus. If this increases, it could mean that some files are missing or the token map configuration is not sending requests to the right server pair.

Phase III – Cleanup

For the cleanup, we just executed a script that deleted the files that were moved to the new servers and were no longer belonging to the old server’s token range. In this phase, we also used strategies to throttle I/O. Since Linux rm command doesn’t come up with its own throttling flag, we used ionice to throttle the disk I/O.

Conclusion and Next Steps

Time and service continuity was the biggest influencer in implementing this scaling strategy. The solution, gave us enough headroom to operate for a few more months, while we strategize around moving this to the cloud for good. The performance of this architecture was good enough to meet the maximum of 150ms object load time, but that performance goal is based solely on serving North America region and not the entire world. We learned that how difficult is it to scale this cluster and why it won’t work for the long run. All other problems remained unsolved until we move this to the cloud as the next step.

The post Scaling a Sharded Image Service appeared first on AryaNet.

Shrinking the Cassandra cluster to fewer nodes

Arya — Thu, 26 Sep 2013 20:40:48 +0000

5 minutes read

I have recently shrank the size of a Cassandra cluster from 24 m1.xlarge nodes to 6 hi1.4xlarge nodes in EC2 using Priam. The 6 nodes are significantly beefier than the nodes I started with and are handling much more work than 4 nodes combined. In this article I describe the process I went through to shrink the size of our cluster and replace the nodes with beefier ones without downtime by not swapping each node one at a time, but by creating an additional virtual region with beefier nodes and switching traffic to that.

Background:

I use Netflix’s Priam for building a multi-region Cassandra cluster on AWS. Initially my cluster consisted of 24 m1.xlarge nodes in a multi-zone and multi-region setup with the following settings:

Hosted in 2 regions;
Hosted in 3 zones in each region;
Hosted 4 nodes in each zone;
Used NetworkTopologyStrategy having a replication factor of 3 in each region;

The above configuration makes Priam to pick the tokens such that each availability zone will have a full copy of the dataset. This means that each node is responsible for 25% of dataset. As traffic and data size significantly increased, I needed to keep up with the strict SLAs. Part of keeping the performance smooth was to eliminate all possible variables that could contribute to sporadic latency spikes. Since in this use case data models need strong consistency, I had set the read and write request to use LOCAL_QUORUM. In addition I automatically run anti-entropy repairs every other night. The performance of m1.xlarge machines started to deteriorate over time and I needed to rapidly make changes to eliminate those performance bottlenecks. As majority of Cassandra community knows, running repair is a CPU and I/O intensive task, and I sometimes observed CPU utilizations going beyond 80%. I also had over 40Gb of data per node and that is more than twice the total memory capacity, so a significant number of queries were hitting the disk resulting in inconsistent performance. I had two options to chose from:

Double the size of the ring; costs more money;
Use fewer but beefier machines; cost remains relatively the same;

I picked hi1.4xlarge machines because they came with SSDs as their ephemeral storage and they were also available in both regions that I was using. I loaded the full dataset to 3 nodes in one region and performed some benchmarks. I was happy with the results:

1ms average read latency on random read/write workload compared to 4ms before;
7ms average read latency for multi-get reads of 50 keys compared to 70ms;
95th percentile was < 200ms compared to the ones up to 700ms; (there were lots of work done here to improve the 95th percentile, I will write about it separately)

Prerequisites:

The prerequisites for this exercise is the following:

Cassandra 1.2.1+ for the new nodes;
Priam with my patch available on Github for the new nodes;
Hosted on AWS EC2;
Using Ec2Snitch or EC2MultiRegionSnitch;
Love and Guts;

The NetworkTopology of the cluster was like the figure above. You may have a different replication factor or may not have multi-region; it does not matter.

Implementation:

Now that I know I want to shrink the cluster to fewer machines, I can do it in two different ways:

Replace one node at a time;
Create another datacenter in the form of virtual region in the same AWS region and switch traffic to it;

Since Priam doesn’t have the cluster shrinking built into it, I decided to go with the second option. This provided me with a clean slate cluster acting as another datacenter (in Cassandra terminology) in the same AWS regions as I had the existing cluster. This way, I could evaluate the health of the new 3 nodes before having them serve traffic and the cleanup process of the old nodes would have been fairly simple. In this method, I utilize the patch in CASSANDRA-5155 to add a suffix to the name of the datacenter in the new cluster.

Create a set of Priam configurations with a new name. For example, if my existing app name for Priam was cass1, I created a new set of configurations with app name cass1_ssd;
I added two new properties to the PriamProperties domain which lets you override the region name and optionally suffix the default behavior; if you don’t specify the dc.name property, the dc.name will default to the AWS region like the original behavior; dc.suffix will get appended to the dc.name;
If you are not using the security group Priam property, you will have to use it as you want these new nodes with the new app name to map to your existing cluster, so configure acl.groupname property to be the same security group as your existing cluster;
Configure zones.available property and add the same suffix you are using everywhere else to the zone names; for example, us-east-1b would be us-east-1b_ssd;
Create the configuration file /etc/cassandra-rackdc.properties with the following content on the new Cassandra nodes:

dc_suffix=_ssd
Start the new nodes; they should show up in your existing ring adjacent to the already existing data centers (AWS regions), so you will have something like us-east-1 and us-east-1_ssd as your datacenter names;
By now you realized the nodes are seen, but because you are using NetworkTopologyStrategy, Cassandra is not sending them data for your existing keyspaces; update the existing keyspaces accordingly to have the replication send data to these new nodes; I had 3 replicas in each region, so in my case I would execute this in cassandra-cli:
use keyspace_production;
update keyspace keyspace_production with strategy_options = {us-east-1:3;us-east-1_ssd:3};
Now that the writes are being forwarded to these new nodes, we need to make sure they claim the entire dataset from the older nodes; this can be done by running nodetool repair -pr;
Once the repair of each node is complete, you should be able to switch your application servers to use these new nodes for read and write operations;
I would run another round of repair on the new nodes just in case to make sure no write was missed;
At this time you are ready to kill the old nodes. This can be done like in step 7 by removing the old datacenter from the replication strategy options;
Kill the old nodes and go to bed;

Caveats:

You probably already know a lot of these best practices. But basically when you are doing such a massive cluster change, you need to be worry about the throughput of your disks and network. Make sure you do the following:

Make sure you keep your Priam backups in case you screw up and had to roll back;
Run repair on one node at a time;
Use nodetool setstreamthroughput to change the streaming throughput to actually what your old nodes can handle without affecting your production traffic;

Conclusion:

In this article, I described how I migrated a large cluster to fewer nodes without downtimes using a patch I have added to Priam. My approach was focused around minimizing impact on the live traffic while we were doing this. You can get the updated Priam code from GitHub.

The post Shrinking the Cassandra cluster to fewer nodes appeared first on AryaNet.

About Arya Goudarzi

Arya — Wed, 21 Aug 2013 23:22:22 +0000

< 1 minute readArya Goudarzi is a Holistic Software Architect and Entrepreneur living in Silicon Valley. He is a pragmatist problem solver who transforms vision and design into tangible solutions. His past work shines on Yahoo! Search, Yahoo! Homepage, Gaia Online, and a Facebook social game hit Monster Galaxy. Currently he is responsible for day to day Infrastructure Operations of CardSpring where he used his extensive knowledge of systems and development to build an automated, cost-effective, highly available, and robust infrastructure that runs CardSpring’s API serving its publishers, financial partners, and millions of API user calls per day under strict SLAs. In his free time, he enjoys cooking, reading non-fiction books, painting, riding his road bike, and improvising at ComedySportz San Jose.

The post About Arya Goudarzi appeared first on AryaNet.

Understanding Cassandra’s Thrift API in PHP

Arya — Tue, 15 Jun 2010 00:19:13 +0000

5 minutes readFor the past few months I’ve been focused on developing a simple, and fast abstraction layer on top of Cassandra’s Thrift API. One of my priorities was ease of use for other developers at my company, requiring to hide Thrift from them. Thus, I had to understand how the Thrift API calls are done for each request. As I watched over Cassandra’s IRC Channel I noticed lots of newbies have issues with understanding the API calls. This post is to show you how you can read Thrift’s auto-generated code and understand how to formulate your calls and parameters for each API call to Cassandra, so that you become up to speed faster. I am using PHP in my examples and I have been developing on Cassandra 0.7 trunc.

Update:I have updated this post with changes in Cassandra 0.7 beta2 and verified the Thrift API version works.

If you’d used Thrift yourself to generate the interface code, you’d notice a folder gets created where you run thrift -gen command called gen-php. Inside that folder, there are 3 files:

Cassandra.php is the main point of entry to study Cassandra Thrift API calls. You’d see it begin with this interface:

interface CassandraIf {...}

followed by a class implementing that interface:

class CassandraClient implements CassandraIf {...}

In your program, you will instantiate CassandraClient probably this way:

$socket = new TSocket(array('127.0.0.1'),array('9160'),TRUE);
$client = new CassandraClient(new TBinaryProtocolAccelerated(new TFramedTransport($socket)));

I said “probably” because depending on your Cassandra configuration you may want to instantiate TBufferedTransport instead of TFramedTransport, or depending on you PHP setup, you may want to use TBinaryProtocol instead of TBinaryProtocolAccelerated. But anyway, that is not the point of this article, so shift your focus back to those three files we’ve auto-generated with Thrift.

So, this file will be our starting point of study. The next file will be:

cassandra_types.php which defines many classes all prefixed with the package keyword cassandra_ . These are actually the object types you’ll need to construct and pass to the API calls in CassandraClient.

And finally the last file is:

cassandra_constants.php which I will give it the least attention as you won’t need to interact with it at all except when you want to check the API version and that is the most important line there. The API version tell you which version of the generated code you’re using. This version has to match the Thrift server’s version in order to be sure that definitions of methods and their functionality is the same and the API calls made to the server make sense to the server and server response makes senses to the API client.

Now that we know where to look for resources that we need for this exercise, let’s start by scratching our heads to formulate the most confused API call, batch_mutate.

Step1: Let’s take a look at CassandraClient->batch_mutate()‘s definition in Cassandra.php:

public function batch_mutate($mutation_map, $consistency_level) {
$this->send_batch_mutate($mutation_map, $consistency_level);
$this->recv_batch_mutate();
}

Step2: If you follow the code inside send_batch_mutate, you’ll see the arguments are mapped to a class named cassandra_Cassandra_batch_mutate_args(). Looking inside the same file, you’ll see the definition of cassandra_Cassandra_batch_mutate_args. Lets just focus on its constructor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

class cassandra_Cassandra_batch_mutate_args {
static $_TSPEC;

public $mutation_map = null;
public $consistency_level = 1;

public function __construct($vals=null) {
if (!isset(self::$_TSPEC)) {
self::$_TSPEC = array(
1 => array(
'var' => 'mutation_map',
'type' => TType::MAP,
'ktype' => TType::STRING,
'vtype' => TType::MAP,
'key' => array(
'type' => TType::STRING,
),
'val' => array(
'type' => TType::MAP,
'ktype' => TType::STRING,
'vtype' => TType::LST,
'key' => array(
'type' => TType::STRING,
),
'val' => array(
'type' => TType::LST,
'etype' => TType::STRUCT,
'elem' => array(
'type' => TType::STRUCT,
'class' => 'cassandra_Mutation',
),
),
),
),
2 => array(
'var' => 'consistency_level',
'type' => TType::I32,
),
);
}
if (is_array($vals)) {
if (isset($vals['mutation_map'])) {
$this->mutation_map = $vals['mutation_map'];
}
if (isset($vals['consistency_level'])) {
$this->consistency_level = $vals['consistency_level'];
}
}
}

Thrift translates data structures defined for a specific system into something called _TSPEC. In this case, our system which Thrift talks to is Cassandra, and our specific data structure is what carries batch_mutate‘s arguments.

Step3: Here is comes the difficult part and that is to understand Thrift types. Thrift’s Wiki has a decent explanation of the types, so I recommend a visit there before proceeding, or alternatively you can read my port about interpreting Thrift’s Data Types and TSPEC. Let’s now focus on the batch_mutate args structure. From reading the code above, you can see that mutation_map is a map of map of lits of cassandra_Mutation. Confusing enough, what does that mean? Maps in PHP are equivalent to hashed arrays which are arrays with unique string keys, and lists are arrays with numeric indexes. But what would be the index names for the first array? That is when I got very confused and made a trip to Cassandra API wiki which says

the outer map key is a row key, the inner map key is the column family name

So, I figured what both keys are, thus code-wise it will look something like this:

1
2
3
4
5

$mutation_map =
array('row_key1'=>array('Keyspace1'=>
array($cassandra_Mutation1,$cassandra_Mutation2,...))
'row_key2'=>array('Keyspace2'=>
array($cassandra_Mutation3,$cassandra_Mutation4,...)));

Step4: Cassandra’s actual data types are all prefixed with keyword cassandra_ and are defined inside the file I previously mentions called cassandra_types.php. In this step we will look at class cassandra_Mutation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

class cassandra_Mutation {
static $_TSPEC;

public $column_or_supercolumn = null;
public $deletion = null;

public function __construct($vals=null) {
if (!isset(self::$_TSPEC)) {
self::$_TSPEC = array(
1 => array(
'var' => 'column_or_supercolumn',
'type' => TType::STRUCT,
'class' => 'cassandra_ColumnOrSuperColumn',
),
2 => array(
'var' => 'deletion',
'type' => TType::STRUCT,
'class' => 'cassandra_Deletion',
),
);
}
}

We trace this down to the other Cassandra types we need in the same file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

class cassandra_ColumnOrSuperColumn {
static $_TSPEC;

public $column = null;
public $super_column = null;

public function __construct($vals=null) {
if (!isset(self::$_TSPEC)) {
self::$_TSPEC = array(
1 => array(
'var' => 'column',
'type' => TType::STRUCT,
'class' => 'cassandra_Column',
),
2 => array(
'var' => 'super_column',
'type' => TType::STRUCT,
'class' => 'cassandra_SuperColumn',
),
);
}
}

class cassandra_Deletion {
static $_TSPEC;

public $timestamp = null;
public $super_column = null;
public $predicate = null;

public function __construct($vals=null) {
if (!isset(self::$_TSPEC)) {
self::$_TSPEC = array(
1 => array(
'var' => 'timestamp',
'type' => TType::I64,
),
2 => array(
'var' => 'super_column',
'type' => TType::STRING,
),
3 => array(
'var' => 'predicate',
'type' => TType::STRUCT,
'class' => 'cassandra_SlicePredicate',
),
);
}
}

OK, you get it now? You can even dive deeper into cassandra_SlicePredicate and others, but I think I have made my point and don’t need to copy-paste more code from Thrift.

Step 4: Not in the last step, we will need to work on creating these structure bottom-up and pass the final result to method batch_mutate. So, let’s create some example columns and insert them into Cassandra:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

//This function is very important in generating correct timestamps for Cassandra
//Read my other post about Cassandra timestamps and PHP
function cass_time() {
return intval(microtime(true)*1000000);
}

//Let's produce some columns we want to insert
$columnA = new cassandra_Column(array('name'=>'column a','value'=>'column a value','timestamp'=> cass_time()));
$columnB = new cassandra_Column(array('name'=>'column b','value'=>'column b value','timestamp'=> cass_time()));

//In our design we will use one super column which has the columns
$columns = array($columnA,$columnB);
$sc = new cassandra_SuperColumn(array('columns'=>$columns));

//We need to form this object, giving it our super column instance because it is what mutation object wants
$c_or_sc = new cassandra_ColumnOrSuperColumn(array('super_column'=>$sc));

//Now create a mutation and give it our ColumnOrSuperColumn object
$mutation = new cassandra_Mutation(array('column_or_supercolumn'=>$c_or_sc));

//Now we create the mutation map as shown in Step 3
$mutation_map = array();
$mutation_map['row_key']['Super1'][] = $mutation;

//Viola! Let's create a client and call batch_mutate()
$client = new CassandraClient( new TBinaryProtocolAccelerated( new TFramedTransport(new TSocketPool(array('127.0.0.1'))));

$client->set_keyspace('Keyspace1');
$client->batch_mutate($mutation_map,cassandra_ConsistencyLevel::ONE);

Here are the noteworthy facts about the code snipped above:

I am using the default Keyspace1 and ColumnFamily Super1 which ships with default Cassandra configuration .yaml in Cassandra 0.7

cassandra_constants.php

In Line 13 of the code above, I chose to use Super1 and thus had to create a Super Column instance to pass to cassandra_ColumnOrSuperColumn. If you were to use a non-super column family, then you had to create multiple cassandra_ColumnOrSuperColumn and set its column property and map it to the mutation object.

batch_mutate

TSocketPool()

Hope this helped you navigate some Thrift code more efficiently and build your client faster. Comments are welcomed and I’ll take your feedback into improving this post for everyone.

The post Understanding Cassandra’s Thrift API in PHP appeared first on AryaNet.

Protected: روان ناروان

Arya — Tue, 10 Nov 2009 10:07:22 +0000

< 1 minute read

The post Protected: روان ناروان appeared first on AryaNet.

Engineering Art

Tue, 05 Jun 2007 12:55:12 +0000

< 1 minute read

I believe engineers are good artists. If you don’t believe me look at the picture above. I drew this for my Digital Circuit’s class project and it is simply a 16 bit adder. Believe it or not, it took me more than 2 weeks to complete the drawing. And it is an engineering art. Its size is only 4 nm-squared. Now, go to the arts department and tell a professor to draw such a small painting. I bet they laugh at you. So who is a better artist?

The post Engineering Art appeared first on AryaNet.

AryaNet

Cassandra Garbage Collector Tuning, Find and Fix long GC Pauses

Background

Problem

When Garbage Collector Stop the World pauses are the culprit?

No Graphs, Don’t worry! There are logs with this information

Understanding Java Garbage Collection

Cassandra Garbage Collector Tuning

Premature Tenuring leading to ParNew Promotion Failures

CMS Concurrent Mode Failure

Cassandra Heap Pressure Scenarios

Aggressive Compactions

Reading Large Slices of Wide Rows

Rows with lots of Tombstones

Key Cache

Problem with Key Cache Algorithm in Pre Cassandra 1.2

Row Cache

Memtables

Heavy Read/Write Workload

All has failed? You are still seeing promotion failures! Need larger Heap

Conclusion:

References:

Unlocking the Power of Data Strategy

Table of Contents

Introduction

Understanding Data Strategy

Different Data Strategies

Data Innovation Strategy

Offensive Data Strategy

Customer Insights

Operational Efficiency

Competitive Intelligence

Defensive Data Strategy

Compliance and Data Privacy

Risk Management

Incident Response

Enterprise Data Strategy

Implementing a Successful Data Strategy

Define Clear Objectives

Data Governance

Invest in Data Quality

Adopt Advanced Analytics

Data Culture

Continuous Evaluation

Managing Technical Debt for Vitality of Successful Software Teams in 2024

Table of Contents

Introduction

Definition of Technical Debt

Ways Technical Debt is Created

Impact of Technical Debt

Types of Technical Debt

Measuring Technical Debt

Tools to Measure and Track Technical Debt

Conclusion

Web Scale Image Service

Storage

Dynamic Image Resizing

Dynamic Resizing at the Edge with Akamai

Host EC2 Instances

Kubernetes

AWS Lambda

CDN Caching

Load Testing

Launch Plan

User Experience Impact

Conclusion

Scaling a Sharded Image Service

Background

Current Architecture

Multi-Tiered Service Architecture

Consistent Hashing Strategy for a Horizontally Distributed Storage

Problems

Solutions

Scaling the Current Architecture by Doubling its Capacity

Phase I – Splitting Token Ranges and Dual Writing Strategy

Phase II – Reconfigure Token Map

Phase III – Cleanup

Conclusion and Next Steps

Shrinking the Cassandra cluster to fewer nodes

Background: