The post Web Scale Image Service appeared first on AryaNet.
]]>In this article I’ll describe the journey to architect and build a web scale image service. This could be the way some very large services like Pinterest, and Instagram host and serve their images.
Recall the previous article where we discussed scaling a sharded image service in a data center? We walked away with a solution that was low risk as time was our limiting factor in our decision. However, it really didn’t solve the core architectural problems. The solution was not at web scale. Let’s review those problems:
Amazon S3 is a good choice for storing images in the cloud. It has a durability of 99.999999999% and availability of 99.99%. However, we need to prefix our objects in S3 so that we could achieve the maximum read throughput from S3. Read the following AWS documentation to learn how to organize objects using prefixes.
Since the image files are hashes, we take the first 6 characters for form object prefixes in S3.
1
2
3
4 On local disk:
/home/images/<strong>3972b1</strong>6287e595a88a50624587bccc66ece0f4c7.jpg
On S3:
s3://my-image-bucket/3972/b1/<strong>3972b1</strong>6287e595a88a50624587bccc66ece0f4c7.jpg
This has a couple of advantages. Because S3 shards the data by prefix, prefixing objects distributes them on different S3 clusters by the first 6 characters of the object. When we have a large cache miss ratio due to caches being cold, the load would not hit a single S3 cluster. Therefore, this solves the thundering herd problem with S3 requests. Additionally, let’s say if we’d want to scan the images in the bucket in a particular range, we don’t have to scan the entire bucket and scanning with prefixes would be a much faster option.
Please note that for any prefix scheme to design, you should file a support request with AWS informing them of your scheme and desired throughput. The AWS team would do some magic in the backend to ensure their service can handle your request with milliseconds latency at that throughput.
Application code was setup to write to both image shards and S3. We also setup a nightly syncing process between image servers and S3. Additionally, we setup a bucket in another region to have a multi-region image storage. Hence, we would have even higher durability and performance in case of a cache miss.
We used Image Magic library with its PHP bindings to write a simple API script that would read the images from S3, perform resizing on the fly and return the output to the cache servers. There are many ways to host this API.
We were exploring CDN options to replace Varnish cache. Akamai at the time had launched a solution within their Image Manager product suite that did the image resizing at the edge. We performed several load tests. However, we couldn’t achieve the desired performance in the cache miss scenario. This could be due to several factor involving dynamic routing of Akamai’s SureRoute and where our us-west-2 Oregon origin was relative to Akamai’s POPs. We worked closely with the Akamai team, but the root cause of performance issues was never 100% remedied. So we moved on from testing this solution further.
One solution would be to host EC2 instances with Apache and PHP FastCGI. This is likely the classic solution of hosting machines with auto-scaling to run the API code. Additionally, our existing configuration management already had packages to setup this machine role, so we were only decoupling the storage to S3 modifying our PHP script tiny bit to read objects from S3.
Load testing this solution revealed desired performance can be achieved in both cold start and warm cache operations.
When we worked on the above solution in summer of 2014, AWS didn’t have the managed Kubernetes service nor Lambda. Ideally even more cost effective solutions could exist. I added these section to this article for better reference of the future evolution of this service.
At the time of writing this article, Kubernetes had just started to gain momentum in the software community. The team didn’t have any prior experience with running things on K8S, so there is likely a longer time to evaluate this solution and settle with it. We simply skipped this for time constraints.
AWS Lambda was released on November of 2014, just 2 months after we launched the solution with EC2. The simplest way to host this API in a server-less fashion is to use API Gateway with Lambda function. This is also the most cost effective way. However, once performing load testing on this solution, the performance can be spotty. This attributed to how Lambda functions are provisioned and persisted in AWS. We had the request load metrics from our last cold start and warm operation of the image cache. So we could eventually follow some AWS Labmda performance optimization tips to tune the Lambda configuration to have persisted concurrency for cold start scenario before launch and then later tune it down.
This would be the ideal cost effective solution upon prove of performance. It is server-less and has a very low maintenance cost.
We were running several proof of concepts with different CDN providers starting with Akamai. Since couldn’t beat our baseline performance with Akamai, we knew we had to setup our dynamic resizing solution ourselves as explained above. This opened up the opportunities to explore a wide variety of CDN providers that didn’t have image resizing in their solution. We chose Fastly because our load testing revealed highest performance compared to other CDN providers and we significantly beat our baseline too. Another plus point for using Fastly was its ease of setup. Since its backbone runs on Varnish and we were very familiar with Varnish, it significantly simplified our deployment and we could move several of our Varnish VCLs and security configurations over to Fastly in a breeze.
We performed load testing via Apache JMeter. We designed our test plan by downloading a list of image URLs from requests in the server logs. To capture a good range, we took the data for one week. Then used JMeter UI to formulate a test plan. This would save the test plan into XML format. We took the XML as our base test plan, and since the CDN domain was parametrized, we could simply switch it in place via a script. A CloudFormation stack would spin up EC2 instances in various AWS regions to simulate live traffic from multiple locations. In the EC2 instance’s startup user script we made all the changes we need to the XML and fired up JMeter to run its test and then uploaded the resulting JTL files to S3. Later we used JMeter and loaded the JTL files and looked at the plots to get average, and standard deviations for request latencies. At the same time, we monitors cache warmup by looking at metrics like hit ratio.
The cutover plan to this new image service was straightforward. To pre-warm the cache, we setup a worker that listened to existing infrastructure’s requests, and fired up a new request to the CDN URL. This means that in the background we mirrored live traffic against CDN. This helps with warming up LRU cache in Fastly Varnish as more requests sent to the same image, would make it persist in cache memory. Once our cache hit ratio was above 80%, we decided it is a good time to cut over, and in the span of a week after cutover, we were able to achieve 92+% cache hit ratio from Fastly which was excellent. Our cache misses at max had the latency of 150ms, and cache hits were served all under 20ms.
The user experience was impacted as part of this transition. We noticed 15% uplift in DAUs and active session length. This speaks to the long term belief that performance does impact user experience indeed.
In conclusion our new solution resolved all of the outstanding pain points in this service architecture. We achieved durability by using S3. We setup multi-region S3 syncing in case we wanted to have multi-region availability later on. Our infrastructure cost for image service was reduced to fraction of the bills we used to pay for renting servers and paying for bandwidth in various data centers. It scales seamlessly without needing manual intervention of a human. Overall, we built a web scale image service that is tolerant, performant, and lasts for good.
The post Web Scale Image Service appeared first on AryaNet.
]]>The post Scaling a Sharded Image Service appeared first on AryaNet.
]]>I once helped a company that its core product was relying on high resolution images, and each page load on its website and mobile apps loaded 60-100 images easily in a single page request. Thinking like Pinterest or Instagram as users browsed the service, the volume of these requests would be staggering and having a performant service infrastructure to serve images directly impacted user experience and revenue. In this article I will show you my first journey with scaling the existing architecture, lessons learned, and lay out the path for a future design that would auto-scale for good.
Please note that as a security cautious leader, I have left out details pertaining the security controls existing in this infrastructure architecture and nothing I share here reveals any secret sauce relating to that.
Since the bread and butter of this business is based on images that its users upload to its service, the durability of its storage is a must have criteria. Second to that is performance and availability of its service because just a single thumbnail loading slow or not loading on a page negatively impacts the user experience. Upon taking over this project, it was quickly realized that at the current user upload rate, the existing image cluster would run out of space in less than 2 months. It is 2014 and nobody should be worried about space issues any more with the advancements in public cloud technologies like Amazon S3, but unfortunately the infrastructure in reference was not hosted on the cloud and it was hosted by a data center managed service provider, hence it came with its own challenges of scaling.
Before we dive into problems and solutions, let’s understand how the existing infrastructure architecture works and what problems we are facing.
There are two managed date center colocations in this architecture as depicted in Figure 1. The one on the left is the primary data center that hosts everything from application servers, backend databases, caching servers, search indexes, and image servers. For the purpose of focusing on image service, the diagram is only showing components that involve storing and serving images, and all other components have been left out. The data center on the right only hosts a caching layer which uses the primary data center as its origin backend. You can think of this being like a CDN point of presence or POP. The reason for this multi region architecture is performance as this company served the entire north America at the time.
Since browsers can handle a limited number of simultaneous requests to a domain, the domain that served the web page was the primary domain like www.mywebsite.com, and a separate static domain was used to server the images like st.mywebsite.com. The static domain is also a cookie-less domain meaning the request and responses don’t carry out any cookie payload. This is a micro-optimization tactic which is the best practice for static services to keep the request payloads low as cookies can be as large as 4KB in size and usually trigger server side processing.
The client browser first sends the page load request to the primary domain which is hosted by the US-East colocation only and receives the HTML back. Then the browser parses the HTML and detects images coming from a separate static domain, and starts to asynchronously fetch them 10 at a time. The static domain is configured via a Geo DNS service like Dyn. It resolves the domain to an IP address pointing at either US-West colocation or US-East colocation based on the proximity of the user to each location for faster connection establishment and date transfer rate.
Inside each colocation there is an active-passive pair of Nginx+HAProxy load balancers that terminates the connection and routes the request to the appropriate cache backend. At the time HAProxy didn’t have TLS support, and a common load balancing stack used Nginx for TLS termination and HAProxy for its core competency which is load balancing. Varnish was used for caching images and assets like JS/CSS, and there were two clusters of two super high memory machines in each colocation (total of four boxes) and each cluster was responsible for either static sized images or dynamic sized images with failover setup between the two clusters. Unfortunately US-West was the only colocation with the source images residing on separate image servers, so US-East didn’t really have its own image source origin backend and instead used the US-West as its origin backend. This is actually how most CDNs work when the source origin is in one place, however, that one place is a huge risk if it comes to the colocation getting hit by a disaster.
The cache servers used a cluster of machines with large volumed disks in RAID1 and RAID0 format as their backend origin. This is where the actual image files were stored. Since there were 10s of terabytes of images, vertically scaling the image servers was not a scalable option, so a sharding strategy was implemented to distribute the images horizontally among multiple active-active pairs of boxes. I will discuss how this logic was done in the next section. Varnish configuration was programmed to serve the images from its cache upon hit and was able to detect which image shard to go to as its backend origin in case of the miss.
The images are uploaded from the client browser through the load balancer to the application server. The application service uses a consistent hashing strategy to determine which image cluster shard the image would fall into as well as creating multiple resized thumbnails of the image.
Here is an example:
An image file is uploaded. To avoid collision in hashing based on the image file name, owner username, and upload timestamp was appended to the file name and then a SHA1 checksum of the resulting string was computed (We are not going to talk about SHA1 collision probability here as it is too low. You can nerd about it from this Google blog post). To determine which shard the image falls into, she first 4 characters of the resulting hash was taken as it was used as the shard token for that image.
We call this a consistent hashing strategy because the resulting shard token will be based on the file name and is consistently producible by the same algorithm.
1
2
3 shard_count = 32
file_name_hash = sha1(original_file_name + username + time_microsecond)
shard_token = file_name_hash.substr(0, 4)
The file_name_hash was used as the image metadata in the application database and that is what was rendered in the image URL along with some metadata appended for SEO optimization, and shard_token was used to map the file name for upload to a specific pair of image machines. Before we get to the shard token, let’s see how the token range is defined and assigned to each machine. Since SHA1 produces a string with hexadecimal characters, taking the first 4 bytes of the string as the token, will give us the following total token range:
1
2 total_token_range = [0000, ffff]
bucket_size = ffff / 32 = 51e
Now assume we have 32 pairs of image machine which translates to 32 image shards. To evenly distribute the image load, we need to divide this range to 32 equal buckets and assign each bucket to a pair of machines. The upper bound of bucket range was stored in the application server configuration as the hash key so that when the image is uploaded, the application server can take the shard token, and iterate over token ranges performing an arithmetic comparison against token upper bounds to determine which image server to copy it to:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 image_cluster_token_map = {
"051e" => Array("images-m-01.internal_dns", "images-s-01.internal_dns"),
"0a3c" => Array("images-m-02.internal_dns", "images-s-02.internal_dns"),
"0f5a" => Array("images-m-03.internal_dns", "images-s-03.internal_dns"),
...
...
"fae1" => Array("images-m-31.internal_dns", "images-s-31.internal_dns"),
"ffff" => Array("images-m-32.internal_dns", "images-s-32.internal_dns")
}
foreach(image_cluster_map.keys() as token) {
if (shard_token < token) {
return image_cluster_token_map[token];
}
}
The above consistent hashing sharding strategy can be easily implemented on the Varnish cache servers where the input would be the image URL that has the file_name_hash, and therefore the origin image cluster can be identified upon cache miss by performing the same arithmetic logic in Varnish VCL.
As you have digested the above architecture, you can easily spot multiple challenges with it:
Time was a huge factor since the cluster would run out of space in 2 months. So, I had to evaluate multiple options with their pros and cons:
Here is the step by step plan of execution for this strategy:
Normally when you order hardware, depending on what hardware architecture like chassis and CPU that you order and the status of supply chain, it could take anywhere from 2-6 weeks for them to arrive at the colocation. I just hoped that this timeline is closer to 2 weeks rather than 6. Luckily the image servers where simple 1U blades, with Xeon E3 processors, and Hitachi hard drives. This was the bare bone server architecture used for all general purpose application workloads. You can think of them being like M class in AWS EC2. Therefore, the data center folks, since they run their own cloud business too, had a lot of these servers in stock and ready to repurpose. My only additional requirement was the RAID controller and extra Hitachi drives to give more storage capacity to each box. So, I asked them to setup 32 more boxes with 16 being on rack A and another 16 being on rack B where other image boxes lived. Each rack had its own switch and separate power source so at least we’d get power and network redundancy in the single colocation. It took two weeks for the machines to get ready for provisioning.
Saltstack was used to automate provisioning of the machines and configure each box with necessary services. Next step was to assign new token ranges to each box and implement a dual write strategy for new uploads as part of the phase 1 of this migration.
In order to evenly distribute the load of existing boxes, we must have split the token ranges that they are responsible for. This is why we doubled the number of shards from 32 to 64 for the total of 128 servers.
In the original setup, there were a total of 64 servers and 32 shards/buckets. The pairs were setup in different racks with different power source and network sources for redundancy in the local colocation. The size of each server’s bucket was 0x051e. Varnish cache nodes were reading from all servers and application servers were also copying newly uploaded images to the server pair which owned the token range for those images. This is depicted in Figure 2 on the left.
The right side of Figure 2 shows the new architecture. We added a new pair of server for each existing pair that claimed half of the token ranges of the existing server pair. This means we split the initial token range of 0x051e by half which comes out to 0x028f. So taking the first pair of servers, if the token range used to be 0x0000-0x051e, the new token range for them would become 0x0000-0x028e and the newly added server pair would claim 0x028f-0x051e. However, the newly setup servers would not have all the images for their token range until the asynchronous process syncs them all over. Therefore, the token range of the existing servers in this phase remains untouched. We just add a new token range configuration in the application server for the new servers, so they get a copy of recently uploaded images after this configuration was setup. This defines our dual-write strategy. Additionally, Varnish cache servers are still only reading from the old server pairs until we confirm synchronization process is done.
At this stage we also kick off a script that runs on one of the servers on each pair which lookups up for images that belonged to the new shard, and copies them over. We also use I/O throttling flags in rsync to avoid slowness in service upon cache missed on Varnish.
We need to wait for all syncs between old and new servers to complete, before we move to the next phase. This process for the images took 2 weeks to complete.
The metrics to monitor in this phase are:
At this point, the new servers have the full copy of the images that they are responsible for serving based on their token map. The old servers however, still have the extra half which we will cleanup after there are no issues in the traffic.
In this stage, we want to remove the dual write mechanism from app servers to both old and new image server pairs. We simply do this by reconfiguring the token map assigned to each server pair to only own their equal share of tokens. At this stage we also reconfigure Varnish cache nodes logic to handle the new token map and use all server pairs as backends.
Good metrics to monitor to ensure everything is good in this setup are:
For the cleanup, we just executed a script that deleted the files that were moved to the new servers and were no longer belonging to the old server’s token range. In this phase, we also used strategies to throttle I/O. Since Linux rm command doesn’t come up with its own throttling flag, we used ionice to throttle the disk I/O.
Time and service continuity was the biggest influencer in implementing this scaling strategy. The solution, gave us enough headroom to operate for a few more months, while we strategize around moving this to the cloud for good. The performance of this architecture was good enough to meet the maximum of 150ms object load time, but that performance goal is based solely on serving North America region and not the entire world. We learned that how difficult is it to scale this cluster and why it won’t work for the long run. All other problems remained unsolved until we move this to the cloud as the next step.
The post Scaling a Sharded Image Service appeared first on AryaNet.
]]>The post Cassandra Garbage Collector Tuning, Find and Fix long GC Pauses appeared first on AryaNet.
]]>Cassandra garbage collector tuning has been a hot topic of discussion in the Cassandra user mailing list from time to time. Users and administrators are required to gain some skills on how to properly tune the garbage collector for their Cassandra nodes so that they can avoid long stop the world pauses that can bring their cluster to halt. Although out of the box garbage collector settings work for most use cases, as the data volume and request traffic per Cassandra node increases, the demand for tuning garbage collector increases also. In this article I’ll explain how you can analyze your Java Virtual Machine and optimize its heap size and settings for your use case. Some of the techniques described here can be applied to tuning any application running on JVM, but majority of them only apply to the way Cassandra works.
Since AWS has launched the beefy hi1.4xlarge instances with SSDs, and Netflix has published a great benchmark of Cassandra performance with them, they became very popular among Cassandra community, and I decided to go through the path of shrinking the Cassandra cluster to fewer nodes utilizing the power of these instances. The initial 24 node cluster was pretty much running with vanilla GC settings that shipped inside cassandra-env.sh. Due to the fact that the 6 nodes had to handle 4 times as much workload, Cassandra garbage collector tuning became important to me and I had to learn and change some settings so that the nodes can utilize the power of their beefed up hardware. These changes included several JVM settings, and at the end I had to actually increase the size of JVM Heap from 8Gb to a mighty 24Gb to maintain the low SLAs and 95th percentiles bellow 200ms.
Your clients are timing out for a closed period of time. If you are lucky this period would not be bigger than a few hundred milliseconds, however by the time people usually complain, this period has gone beyond seconds which is a really bad behavior for the database layer. The only assumption I am making here is that these timeouts are due to Garbage Collector Stop the World pauses and not other issues such as CPU or I/O saturation. In my case I was able to produce a handful of multi-second stop the world GCs during my load test throughout a day. Hence Cassandra Garbage Collector tuning became a new topic of interest to me.
Garbage collectors like Concurrent Mark and Sweep (CMS) and Parallel New (ParNew) will always have stop the world pauses. Usually the pause is small enough that is not noticed by the application. However, the pause can be long enough such that it makes the application unresponsive. As the first step, we need to implement adequate logging and monitoring for our system because you want to be able to reference history of events when the issues occur and trying to watch metrics with tools that show current state of application may not be sufficient. Java provides a nice facility that can log very detailed information about the heap statistics and garbage collector performance. Out of the box Cassandra by default ships with a set of GC logging parameters that are originally disabled in cassandra-env.sh. I highly recommend everyone to enable those on their production clusters. In addition to the default logging options, I’ll recommend enabling these flags as well:
The histogram in particular will be very helpful because usually at the time that you have long pauses, you want to see what objects are consuming most of the heap, and the histogram will provide you with the answer to that.
There are other tools you can use to monitor the heap space such as jstat and I recommend using them. If you are fancy you can parse, collect, and plot the data from these tools and logs into your monitoring application. In my case, I plot both stop the world GC times from the logs, which is the most accurate stop the world metric, as well as how the generations inside the heap are utilized from jstat:
In the top graph we see that a little before 16:00 there is a sharp cliff in the heap space used. This means that CMS Collector kicked in. You can see that it kicked in around 6Gb heap usage which is 75% of total 8Gb. You also see another cliff before 21:00 which is deeper. The 75% is based on the GC flag which is set inside cassandra-env.sh as CMSInitiatingOccupancyFraction=75. It tells JVM to force CMS to run a full sweep if heap is filled up at least 75%. This process doesn’t always take too long. As you see in the bottom graph, only the first cliff caused a stop the world GC of about 15 seconds which is terrible, but the second one didn’t despite it being able to free up more space. Long pauses are usually caused by full GCs. And full GCs happen when CMS cannot collect enough objects concurrently hence it falls back to the Parallel algorithm, pauses and scans all objects in the heap. I’ll talk more about this later in the heap crash-course.
If you don’t have the fancy graphing tools, or even forgot to turn on GC logging options in cassandra-env.sh, Cassandra’s system.log will log every stop the world GC that took more than 250ms as long as you did not reduce the logging severity to something less than INFO in log4j-server.xml. You will see log lines like this:
In the GC log you will find something like this:
Congratulations. You have just found out that you have a problem with super long GC pauses. Hence the rest of this article become relevant to you.
I highly recommend you to read Understanding Java Garbage Collection which is really well written in my opinion as I am not going to duplicate this great article here. It will help you understand how the JVM heap is setup and how different GC algorithms work. Then head back here and I will walk you through scenarios that can cause Cassandra to have long GC pauses.
The basic takeaway from the Understanding Java Garbage Collection is that Garbage Collection process works best with the assumption that most objects become unreachable soon and there are small number of objects referenced in between old generation and new generation. If these two assumptions are violated we end up with slow GC performance. So, for Cassandra we think about how objects end up being long lived in the heap leading to this scenario. It is more efficient for objects to expire in the young generation than having to get promoted to the old generation and cleaned up there with a full GC cycle. There are usually two scenarios in which GC algorithms fail and fall back to the full parallel stop the world GC which is the case for super long pauses:
Premature Promotions is when objects don’t live long enough in the young generation before they expire. This causes the ParNew cycles to promote them to the old generation. The more objects get promoted to the old generation like this, the more fragmented heap will get. As a result there won’t be enough room in old generation for young objects and ParNew will fail with Promotion Failure.
The following graph shows the percentage utilization of different generations inside the heap space. The blue line which is like a saw, is the Eden space. On average 51% of eden is utilized which is normal. But if you look at survivor space utilizations, you notice they are utilized 5% on average. There is a cliff on the old space utilization before 16:00 which co-relates to our slow GC. This means that although the survivor spaces seem to be filled 50%, the old generation keeps filling up earlier than survivors. This means that there is opportunity to tune the young generation as we have a premature tenuring. Heavily loaded servers specially serving thousands of read requests can easily fall into this scenario.
This can be observed from the GC logs as well:
There are several reason for CMS to return a concurrent mode failure. Read the Understanding CMS GC logs to familiarize yourself with all of them. The most common one is when the application is allocating large objects and CMS gets interrupted in the middle of its work because the tenured generation fills up faster than it is cleaned up. A few bad designs and uses of Cassandra may lead to this. If you have concurrent mode failure it will be logged in your GC log:
The large block specifically is a sign that there is a large object in the heap.
There are several cases in which you will have a heap pressure. I have personally experienced all of them. In this section I will describe each scenario in the order based on the frequency that Cassandra users report them. It is advised that you troubleshoot and get to the root cause before trying to implement solutions in a trial and error fashion. The following assumptions are made based on defaults settings of Heap and GC in cassandra-env.sh:
By default Cassandra will run one compaction thread per CPU core available. During the compaction process, Cassandra will read each row at a time from all SSTables that it belongs to, it cleans up tombstones if they are expired beyond gc_grace_seconds, and sorts the columns and writes the row back into a new SSTable. It does this process all in memory up to the value of in_memory_compaction_limit_in_mb in cassandra.yaml which is 64Mb by default. Now if your NewGen size is 2Gb on a 16 core machine, you can have up to 64Mb x 16 = 1Gb of heap filled up with columns from compaction threads. Depending on how your compaction is throttled, these objects can live longer in heap and be quickly promoted to the old space causing it to get filled up. If you are running a high CPU machine with the default heap settings, this can easily hit you. It is hard to absolutely pin the problem to compaction but with instrumentation in Cassandra code you can make that conclusion.
Let’s say a slow stop the world GC has happened and you have a histogram of objects inside the heap from GC logs:
Histogram Before GC:
Histogram after GC:
The histograms are much longer; I am pasting the top part which is usually the most useful. You can see that before GC runs, heap had 113Mb of Columns. It seems a large number for an entity which should rapidly go in and out of the heap. Columns are barebones of data storage in Cassandra code, so there could be many sources:
First you want to rule out the wide rows which is usually an application issue. Then you can figure out if it is compaction. If you have coding skills, you can add instrumentation to sample Column object’s instantiations to see what process instantiating so many Column objects:
In this case we can see tons of these which shows compactions were creating lots of columns.
Once you found compaction to be the issue, try throttling them down and in a high CPU machine, even reduce the number of concurrent_compactors from default to something like half the number of cores.
Some people criticize having wide rows because they introduce a hotspot into the system. This is true if the key is very active. However, for implementing global secondary indexes for the purpose of pagination, it can be a simple design. However, it should be handled with care. If your application is reading hundreds of columns from a single wide row , it will fill up new generation fast causing promotion failures, and sometimes concurrent mode failures leading to super long GC pauses. The first step to figure out if you have this problem is to look at your Column Family statistics with ‘nodetool cfstats’ command and examine if there is a column family with maximum compacted row size above 10Mb:
In this case, this example CF could be the source of long row. Compacted row maximum size is 223Mb. If a client attempts to read this row entirely, there will definitely be a GC hiccup. The last two lines provide useful information as to how your clients are reading from this column family. Average live cells per slice tells that queries are usually reading 51 columns at a time which is good. If this number it above hundred, depending on the size of data stored inside your columns, you may have the problem and need to tune down your application to read lower number of columns at a time and paginate through.
When you have rows with dynamic columns and your application deletes those columns, Cassandra will apply a tombstones on those columns until next compaction runs and cleans up expired tombstones passed gc_grace_seconds. When your client does a slice query to read a few columns, even if majority of columns were perviously deleted, Cassandra will have to read and scan all the tombstones in order to find as may non deleted columns as you requested in your slice query. This also causes a major GC pause specially when the number of those tombstones is in hundreds. You can troubleshoot this by looking at ‘nodetool cfstats’. Take the example in previous section, Average tombstones per slice will give you that information. If it is a large value then that is most likely one source for your problem. The following snipper from GC logs shows about 66Mb of DeletedColumn object which could also be a clue:
To cleanup tombstones, you can lower gc_grace_seconds and force user defined compactions on SSTables in question. [TODO: write an article for this]
Key Caches are very useful, so sometimes people think that if they increase the key cache size portion of the heap, they can help their application. It is true and false. You can increase the key cache size to the point that you are occupying a significant portion of the heap. Since key caches can be long lived if you have hot rows, they can get promoted to the old generation fast. This also limits the amount of space available to other operations like reads, repairs, compactions, and memtables.
If you are running Cassandra 1.1 or earlier and have set your key cache size to any size, it is a lie. In this case the cluster had 1Gb as Key Cache size. This problem exposes itself quickly if you have a use case like a map-reduce job that reads whole bunch of keys at once. Lets look at the following histogram:
There are 773Mb of long arrays and over 1Gb of byte arrays and over 15 million Key Cache objects which is in accordance with the number of keys that the cluster had. If you look at Cassandra Code , there is an assumption that keys are 48 Bytes (AVERAGE_KEY_CACHE_ROW_SIZE). This assumption is used to compute the size of key cache consumption in heap. You can quickly make a guess that if your developers have made keys that are longer than 48 Bytes, your key cache can easily use more memory than it should have without you knowing it.
The best solution for this is to not have such a large key cache anyway. The default of 100Mb can hold a lot of keys if you are not crazy to have keys more than 48 Bytes. I usually advise my developers to not have keys more than 32 Bytes. Alternatively, you can disable the key cache entirely for the column families with large keys, or upgrade to Cassandra 1.2 as this issue (CASSANDRA-4315) is fixed.
If you are using Row Cache and you are specially using the old ConcurrentHashCacheProvider, row caches are stored on the heap. For the very same reasons as the previous section, you may run into heap pressure and super long GCs. Therefore use SerializingCacheProvider or turn of Row Cache completely.
If you have a write heavy workload, it is usually a good idea to increase the memtable segment of the heap because it absorbs more writes and leads to efficient compactions when flushed to disk. However, on a workload which has more reads than writes, if memtable space is large, it takes room away from heap for other purposes like serving reads. Moreover, if memtables aren’t flushed because the flush threshold is large, they have higher chance of being promoted to the tenured space. Usually when long GCs happen and Cassandra is under a heap pressure, MemoryMeter.java will output useful logs into system.log with details about how much cells where in memtables and from what column family. From that output you can decide if memtables are wasting your heap.
In this scenario, you can reduce memtable_total_space_in_mb in cassandra.yaml. If you have a high write and read workload and you need to optimize for both, you may have to increase the heap size.
If you have mix of read and writes in the order of a few thousands of each per node, it is usually the case that the default heap settings will not work for you. Due to the high load which is the nature of your cluster (in my case shrinking the load of 26 nodes to 6), the ParNew cycles will happen more frequently. This means that the objects in the young generation that could be anything, columns being read, column being compacted, key caches, memtables, etc will quickly get promoted to the old generation after 1 GC cycle. In this case old generation will fill up faster and you will potentially have more full GCs. Looking at this example:
Notice all object have age of 1. This mean these survived after one ParNew GC round.
In this case, try increasing the value of MaxTenuringThreshold to something bigger. In my case I had to increase it up to 20. This will increase the amount of work ParNew has to do, but it prevents objects from getting promoted quickly. Then in GC logs you can see objects will love longer in the young generation:
ParNew auto tunes itself. If it sees more long lived objects, it will increase the TenuringThreshold after each run up to the MaxTenuringThreshold. It it sees objects are expiring faster, it will reduce the TenuringThreshold.
If you are unlucky like me you have to go to the final resort which some people hate in Cassandra community. You will need more room for your young generation but it is not advised to increase the young generation to more than 1/4 of the total heap size. In fact Java Hotspot won’t start if someone increases the young generation beyond 1/3 of heap space. You need to have larger old generation for healthy promotions. In this case gradually start increasing your heap size and young generation size to the point that you don’t go beyond 1/2 of your available physical memory. I recommend increasing the heap size by 2Gb at a time and adjust the young generation size accordingly. Keep in mind to keep your high value of MaxTenuringThreshold.
The drawback of running Cassandra with larger heap is that your ParNew cycles may take longer but if you have more CPU it will not be a problem. The benefit is majority of your expired objects will be cleaned up in young generation. At the same time you will not super log stop the world pauses.
In Cassandra Garbage Collector Tuning we learned that it is a tedious task to find out sources of long stop the world garbage collector pauses. Every setup is different, and caution is advised if troubleshooting is to be done on production system. We reviewed the following topics:
Now let’s look at the final outcome and compare how I did after the tuning exercises:
The improvement can be observed quickly. From the Heap Utilization Graph you can see that the maximum heap size was increased to 24Gb and it is utilizing 10Gb on average. You can see the cliff in the graph which is a sign of CMS GC execution, but in the Stop the World GC times graph before 20:00 there no significant GC pause. However, you can see the average GC pauses have gone up a bit since ParNew is cleaning more space. The Heap Space Utilization Graphs shows that the survivors are utilized 28% on average compare to 5% before.
If you like to read more about garbage collector tuning, there are hundreds of articles on the web and I have cherry picked a few which are my favorites here:
The post Cassandra Garbage Collector Tuning, Find and Fix long GC Pauses appeared first on AryaNet.
]]>The post Shrinking the Cassandra cluster to fewer nodes appeared first on AryaNet.
]]>I have recently shrank the size of a Cassandra cluster from 24 m1.xlarge nodes to 6 hi1.4xlarge nodes in EC2 using Priam. The 6 nodes are significantly beefier than the nodes I started with and are handling much more work than 4 nodes combined. In this article I describe the process I went through to shrink the size of our cluster and replace the nodes with beefier ones without downtime by not swapping each node one at a time, but by creating an additional virtual region with beefier nodes and switching traffic to that.
I use Netflix’s Priam for building a multi-region Cassandra cluster on AWS. Initially my cluster consisted of 24 m1.xlarge nodes in a multi-zone and multi-region setup with the following settings:
The above configuration makes Priam to pick the tokens such that each availability zone will have a full copy of the dataset. This means that each node is responsible for 25% of dataset. As traffic and data size significantly increased, I needed to keep up with the strict SLAs. Part of keeping the performance smooth was to eliminate all possible variables that could contribute to sporadic latency spikes. Since in this use case data models need strong consistency, I had set the read and write request to use LOCAL_QUORUM. In addition I automatically run anti-entropy repairs every other night. The performance of m1.xlarge machines started to deteriorate over time and I needed to rapidly make changes to eliminate those performance bottlenecks. As majority of Cassandra community knows, running repair is a CPU and I/O intensive task, and I sometimes observed CPU utilizations going beyond 80%. I also had over 40Gb of data per node and that is more than twice the total memory capacity, so a significant number of queries were hitting the disk resulting in inconsistent performance. I had two options to chose from:
I picked hi1.4xlarge machines because they came with SSDs as their ephemeral storage and they were also available in both regions that I was using. I loaded the full dataset to 3 nodes in one region and performed some benchmarks. I was happy with the results:
The prerequisites for this exercise is the following:
The NetworkTopology of the cluster was like the figure above. You may have a different replication factor or may not have multi-region; it does not matter.
Now that I know I want to shrink the cluster to fewer machines, I can do it in two different ways:
Since Priam doesn’t have the cluster shrinking built into it, I decided to go with the second option. This provided me with a clean slate cluster acting as another datacenter (in Cassandra terminology) in the same AWS regions as I had the existing cluster. This way, I could evaluate the health of the new 3 nodes before having them serve traffic and the cleanup process of the old nodes would have been fairly simple. In this method, I utilize the patch in CASSANDRA-5155 to add a suffix to the name of the datacenter in the new cluster.
You probably already know a lot of these best practices. But basically when you are doing such a massive cluster change, you need to be worry about the throughput of your disks and network. Make sure you do the following:
In this article, I described how I migrated a large cluster to fewer nodes without downtimes using a patch I have added to Priam. My approach was focused around minimizing impact on the live traffic while we were doing this. You can get the updated Priam code from GitHub.
The post Shrinking the Cassandra cluster to fewer nodes appeared first on AryaNet.
]]>