The Developing Story

RSS

Done is More Than You Think

Have you ever had a conversation like this with a developer on your team?

Mara: So…is the IRC meme generator done?

Beck: Yup! All done.

Mara: Ok cool, I’ll take a look at the pull request.

Beck: Oh, wait, I haven’t committed it yet.

If you have, your team probably doesn’t have a consistent definition of the word “done”. At Wattpad, we recently realized that we had this problem. We were going to just write up a checklist that all our developers could consult before declaring “Yup! All done,” but we quickly realized just how important the definition of done is to a company.

To be considered done, anything we do (whether it’s a new feature or a bug fix) should be in the hands of our users (live) and it should make them happy. Yes, this means that a feature could be in production but still not be done. We might, for example, launch a feature in stealth mode at first in order to collect data on performance. Collecting that data is still part of the work for that feature so it’s not done yet. But that’s not all. Another developer, new to the code for a particular feature or bug fix, should also be able to change it easily and be happy in the process. So we’re done if our users are happy and our developers are happy. Makes sense, right?

We hope to achieve this by practicing the following.

Code Quality Standards. You’ve followed the team guidelines for code quality. Things like always leaving the campground cleaner than the way you found it, making sure there are no new compiler warnings, not repeating code, etc.

Manual Testing. You’ve done some thorough manual testing of your own work (on a real device if it’s mobile). You put on your QA hat and really tried hard to break what you just wrote. With millions of users, if you can think of an edge case, it will happen.

Automated Testing. You’ve written unit and integration tests to help future maintainers of your code understand it more easily. You’re also ensuring that regressions on that code will get caught by your tests later and you’re building a safety net to make refactoring less risky in the future.

Performance Testing. Don’t assume your feature will just work when millions of people are using it at the same time. Test it first. Prove that it will work.

Documentation. Yeah, you know how it works. But will another developer six months from now? Document it.

Code Review. Two brains are better than one. Get a teammate to look at your code; they’ll almost always find something you missed.

Rollout Plan. This doesn’t always apply but if, for example, a schema change is required, you’re not done. Plan how to make that schema change and how to roll back if code hits the fan.

Ship It! You’ve crossed your I’s and dotted your T’s so you’re probably feeling pretty confident you can actually deliver your work into the hands of your users. Go for it.

Following Up. As mentioned earlier, it’s not done until users are happy. Now is the time to collect feedback and data to determine if they really are happy. You probably also want to monitor error rates and performance to make sure you didn’t break something else. If all checks out, you’re done!

A company’s definition of done is directly related to the quality of their product. If, for example, what you consider done does not include testing, your product will be buggier than a company’s that does include testing. Thinking you’re done when you’re not is also a great way to incur technical debt. Technical debt won’t necessarily affect the quality of your product but it certainly will affect your team’s velocity (how much stuff they can get done in a given period of time).

More importantly, defining done helps us decide what to do and when. Without a definition, you really can’t estimate how long something will take. One developer may give an estimate based on coding the feature, testing it and writing documentation for it. But another may estimate based solely on coding the feature. Without a consistent way to estimate, it’s impossible to pick the right amount of features to fit into your next sprint. If everyone is on the same page with what it means to be done, you can reliably determine feature A won’t fit into this next sprint, but feature B will. Otherwise you’re just guessing.

It’s also important to have a consistent definition of done across teams. Without that consistency, it makes it really difficult to gauge how the teams are performing. If team A unit tests but team B doesn’t, on the surface, team A will look like they’re working faster. But in reality, their code is probably buggier and less maintainable.

We’re really looking forward to seeing how this improves the quality of our product and how much more consistent our schedules will get.

@richpoirier

Android Essentials - Continuous Deployment on Google Play

You’re moving fast, real fast, squeezing in your last bunch of changes, doing some quick tests and then you upload the APK to Google Play. You’re about to hit publish, weeks of hard work comes down to this. Did I miss anything? How will users react? 

As your app grows, this problem increases because you have more users (that’s the whole point, isn’t it?) and you are probably adding more enhancements in each iteration (hooray!) But adding users and features increase your risk of a bad app update, and the impact it has on users.

At Wattpad we’ve been looking for ways to improve our app deployment process: balancing the frequency of updates while maintaining app quality. Here’s what we’ve found works for us, and a shout-out to the Google Play team on making this possible and how it can be improved.

image

Our 4-week app release cycle

The first week is for implementing new features and engineering work such as app performance or stability. By the end of the first week, some changes are ready while others are still in progress. To get more mileage and real life feedback, we want everyone internally to use the app. We used to email out the APK or put it up on a web page, but that process was clunky. We saw this new feature on Google Play and thought it would be great for internal testing purposes.

We’ll continue to push out new builds internally until we have something ready for a wider audience. This is typically by the end of the second week. For this we once again use Google Play. We create a special distribution called Wattpad Beta and make it publicly available by uploading it to the Play Store. Anyone can download the beta but we make sure to tell them it is a beta version and are looking for their feedback. Here’s a blog post on this.

image

Beta app usage after a few days (Google Analytics)

This is an extremely effective way to get people to use and test your app for free. Currently we have about 25k active beta users. We use Google Analytics to track application usage and crashes. We can also reply to users directly on Google Play in response to problems reported.

By the third week, we have a stable beta build. We use Google Analytics to track and understand how people are using the new features, to tweak features and identify bugs.

Week four is mostly a polishing step where we fix low impact bugs and minor UI tweaks. The idea is that by the end of the fourth week we have a very stable release that has gone through several iterations of testing by thousands of users. The approach here is test early, and test often. By taking this approach, we rarely see unexpected issues in our production app.

Since we are creating three separate builds, the internal, external beta and production, it can be a bit tedious to keep track of the different version names and codes. Also there is quite a bit of manual work required to swap out the icon assets and rename the application package. To minimize this work we wrote a python script to take care of this for us. 

image

Go ahead, take our beta app for a spin.

Orrie

Beginner (and not so beginner) tips for Memcached - Part 2

Last time, we talked about getting the most out of the memory and network bandwidth limits and what to do to scale beyond that. Once you start sharding data across multiple memcached servers, be sure to use consistent hashing for your keys. This way as you add or remove memcached servers, it will minimize the impact of key redistribution. Most clients will provide some sort of consistent hashing, so no need to write your own.

As you already know, memcached is pretty light on features. One capability we really wanted was the ability to serve up expired data in some cases. For example, we need to fetch the latest leaderboard to display on a page. This is a time consuming query but not time critical (it’s ok if we serve up an old version). In this case, we’d like to fetch and display the expired, cached data while issuing a command to update the data. One way to enable this is to implement your own expiry field in all the data stored. When a value is stored in memcached, it is always stored with the maximum expiry (30 days). But when it is fetched, the custom expiry field tells us if the data needs to be refreshed. This gives us the option to use the data even if it is expired.

Here are a few tools to monitor and profile your memcached servers.

memcached-top - gives you a top-like interface to see the activity on all your memcached servers, such as hit rate, connections, eviction rate and network read / write.

mk-query-digest - part of maatkit (now percona toolkit) this little script was originally used to analyze your MySQL slow query logs. But it can also analyze your TCP log and help you understand your memcached traffic.

twemproxy - a memcached proxy from the folks at Twitter to help manage and pool connections. This is handy when working with a large number of memcached servers and web servers.

If you have suggestions or want to know more, be sure to reach out on twitter @ivanyuen so we can exchange ideas.

Beginner (and not so beginner) tips for Memcached - Part 1

We love Redis and use it a lot here at Wattpad. We’ve even written about it here and here. Redis is powerful, easy to use and good for a lot of things.

When it comes to in-memory data stores, the most popular is still Memcached. It’s rock-solid, predictable and dead simple to setup. We use Memcached heavily as well (65Gb across 5 servers) for caching all sorts of data. If you are using Memcached or you’re considering it here are some tips to make it work better and keep it running as you scale up. We’ve sneaked in a few tips even the pros can use.

As your system demands grow, the first bottleneck you’ll likely encounter is excessive eviction of cached data. Data that is cached is being prematurely removed before its expiry because Memcached has run out of memory. This reduces the cache efficiency, i.e. the hit rate. The simplest solution is to allocate more memory (add RAM). Another option is to shard the data across servers. Many clients have built-in key hashing so values are distributed across multiple Memcached instances. This is workable to a certain scale but it’s not infinitely scalable - more on this later. You’ll also notice that evictions are happening even though your memory utilization has not hit 100%. This has to do with the internal workings of Memcached and the way it determines slab sizes - it will store data in the determined slab based on size of the data and evicting data only from that slab if it is full.

When storing large values, say a DB result set in the form of an array, be careful not to exceed the value size limit (default limit is 1MB.) To store more, you can increase the maximum size config option or use compression on the stored value. Memcached itself does not provide compression but many popular clients do. It’s generally a good idea to enable compression all the time, this makes efficient use of the in-memory storage as well as reduces data sent over the network.

The second bottleneck you’ll see is the network throughput. Fetching data from Memcached is fast - sometimes so fast you forget that it’s actually doing something. Check your code to make sure you are not pulling the same data from Memcached repeatedly; don’t treat it like it’s local memory. Because the more you use it, the more data you need to pull in over the network. Monitor the network transfer rate closely, it’s not always obvious when this is reached. Once that happens, your options are to increase the network bandwidth, or shard the data across multiple servers so the demands on each server are reduced. It’s also possible to use UDP instead of the default TCP, but you’ll need to pull some tricks.

Multiget operations can come in handy, batching up requests so you can fetch them in one shot. This reduces the overhead of making multiple requests sequentially. But this can cause a problem as you scale up and shard data across servers. As the data gets distributed across different servers, it starts to negate the benefits of multiget because it can no longer get the results from a single server with a single request. This problem (dubbed the multiget hole by the folks at Facebook) puts a practical limit to how data can be sharded. To offset this, you can shard the data smarter, perhaps based on proximity of use (anyone know any hashing algorithms or tools to automate this?)

We’ll cover more tips on Memcached in the next installment: using expired data, adding/removing servers, useful tools and other “gotchas” we encountered.

@ivanyuen

Leveraging Your Users as Testers - Using Beta Versions on Android Play Store

At Wattpad we are fortunate to have a very large user base, but unfortunately don’t have as much time or people as we would like to do the testing for the many different Android devices they use. The solution? Let the users be a part of the testing process by adding a beta version of our app to the Play Store.

Other blog posts and stack overflow answers didn’t seem to give a lot of insight on how well it worked or dismissed it as not being best practice. However several other major apps such as Firefox and Angry Birds were using a beta version of their app on the Play Store and thought we might as well try it as well.

It is simple enough to do. First change the package name of the application by right clicking on the project in Eclipse -> Android Tools -> Rename Application Package so that there can be two different apk’s on the Play Store and then sign it using the same key you use for your actual app. The second thing we do whenever we upload a new version is swap out our actual Wattpad launcher icon for the one the one at the top of this post, so that when both versions are installed on the user’s device, it is clear which one is the beta app. Finally, after uploading the apk to the Play Store, we made it clear in the description that this is a beta version and may contain bugs. 

Like any other beta, it is a symbiotic relationship. The users get to try out all the new features before they are in production, and we receive valuable feedback and find bugs before it is released to everyone. We went from ~1 tester (who also has to split their time testing for iOS and web) to more than 2000 testers in a few days. The best part is that these testers are free and love testing things out. However we are careful not abuse these user’s loyalty by still going through an in house testing process consisting of automated and manual testing on several devices with varying screen sizes, OS versions, manufacturer types and resolutions. The beta testers are more for testing things that require a large volume of use, localization problems, wider range of device coverage and for finding bugs that slip through our cracks. 

Another helpful feature we added to the app was a logging tool that would log the circumstances around the crash and the stack trace, and prompt the user to email it to us on their next startup. This also gives us the advantage of getting to communicate directly with the user who experienced the crash to find out more information. Since most of the users of the beta are avid Wattpaders, they are often more than eager to help out, including sending screenshots and trying out custom builds to narrow down the problems. This logging feature is only enabled when the application package matches the beta name we use, so we don’t have to worry about remembering to turn it on and off when making different builds. 

Again we are fortunate to have such a large user base, but I don’t see why this technique can’t be employed by more companies. By clearly marking it as a beta version it should not tarnish your company’s image and is extremely helpful in finding bugs before it goes out to everyone. 

Orrie

Nov 6

True North PHP Conference

I just gave my first public technical talk over the weekend at True North PHP Conference in Mississauga. There hasn’t really been a PHP conference in Toronto in six years so we decided to sponsor the event and speak at it.

My talk was called Evolution of a Web Architecture with Amazon Web Services. I went through all the scaling milestones we hit over the last three years, starting on one server hosted at 1&1 all the way to where we are now (a few dozen instances on Amazon Web Services).

Here are the slides:

Learning from Web Engineering Blogs

Whenever I meet engineering students or web developers, a common question that comes up is - how do I improve my development skills? Of course there are many ways but one that I like to recommend is to find out what those in the industry are actually doing. Many web development teams have a blog and you’ll get a sense of the real problems and solutions by some really smart people. You’ll see who’s doing neat stuff and what tools they are using. It’s the next best thing to an internship at these fine establishments.

Etsy

Instagram

Airbnb

Foursquare

Which are your favorites?

Ivan

Zero-downtime changes in MySQL - the easy way

When I watched Baron’s webinar on zero-downtime schema change last month, I was pretty psyched. If you’ve ever tried to alter a table in MySQL with hundreds of millions of rows in a production environment, you know the challenge in dealing with table locking limitations with InnoDB. Up until now, we’ve managed our schema changes at an application level in order to avoid downtime. This was both time consuming and error prone. This tool in the Percona Toolkit promises to make this much easier so we took it for a spin.

The premise is simple. Instead of altering the table directly, the script creates a new instance of the table and applies the desired changes. Then it migrates the data over in batches to avoid locking up any particular row for an extended period. But before it does this, it also sets up a few triggers to mirror any new changes to the new table. Once all of the above is completed, then it does a table rename and drops the old table.

After using it for a month and running more than a dozen database migrations, I’m happy to report that the tool lives up to its promise (well, almost!) but not without some caveats. I’m not going to go into all the detail here, there’s plenty of detail in the documentation but I will highlight a few things you should definitely know:

  1. It can be a much slower process than a straight ALTER TABLE, especially if you are running on a table with a lot of rows. The overhead of the triggers in propagating the changes also adds to this load. The estimated time remaining is handy but not all that accurate.
  2. The table you’re altering must have a primary key or a unique index, this is needed to trigger changes between the old and new tables.
  3. If the script is interrupted, you may need to do manual cleanup before reattempting. In most cases, it means having to drop the temporary table and triggers.

Overall, the tool is well thought out and covers a lot of use cases. It’s also very defensive in the way it makes changes, which most people will appreciate especially when things go wrong. As with any changes on a production environment - which is what this tool is all about - it’s definitely a good idea to test thoroughly on a staging environment and also on smaller datasets in production before running with it. That’s what we’ve done here at Wattpad and now it’s an invaluable tool in our tool chest.

Like to work on large-scale web problems? We’re hiring!

Ivan

Store more stuff - memory optimization in Redis

If you read our last post, you’d know we’re fans of Redis - and we’re doing more with it each day. It’s really easy to get started with Redis. But unlike other popular data stores, memory optimization is much more important in Redis for a few reasons. With memcached, it’s easy to add to the cluster. You also don’t need to worry about consistent hashing, or running out of memory because data is transient anyway and you’ll just get more evictions. With MySQL, storage is cheap, it’s relatively easy to add a TB or two.

With Redis, it’s got to all fit into memory. So sooner or later, you will want to take a closer look at your data and how Redis is storing it.

What do you do when you see your Redis data size creeping up, you figure you probably have about 2 weeks before you will hit the system limit. Time to upgrade right? Not before you read this on memory optimization.

Our data was reaching the 32GB limit fast. The main data we’re storing are user news feeds, which is an array of ids associated with events in a person’s feed. Each user’s feed can grow to thousand of items long. For the most memory efficient storage, we want to take advantage of ziplist encoding. The default value is 512:

list-max-ziplist-entries 512

We limit the recent news feed to at most 2000 items, so we changed this setting to 2048. You can check the encoding type on the command line interface with OBJECT ENCODING <key>, you want it to say ziplist. There are other settings related to different types, it’s probably a good idea to make sure your most common data are using this encoding. The tradeoff here is CPU, for lists that are very large you want to keep an eye on your CPU load.

We also use hashes to store lists of IP addresses. But only hashes of integers can use ziplist encoding. So before we store the IP we convert them to ints.

So what does this all get you? We reduced memory reduction by nearly 75% with just the two changes listed above.

Here’s a nice tool to help you look at your entire dataset and how the memory is used.

What’s next? Now that we’re making better use of memory, we can put more items into it. Data sharding is still necessary and predis has good key hashing to handle this. Also check out this tool from the folks at Instagram.

Ivan

Using Redis pipeline to write news feed

We’ve been running Redis at Wattpad for a while now. We started to seriously look at it when we added the user news feed. Redis seems like a practical solution for this because feeds are normally write intensive and it would also be a great way for us to take Redis out for a test drive. If there were problems, no one would miss one or two items in their news feed, right?

Redis setup is straight forward, so let’s skip the boring stuff and go right into some of the challenges we’ve had. For backups, we use the built-in snapshots to disk and then upload that to S3. For this, default settings are fine if your dataset is small.

Snapshots occur quickly and does not block read/write commands. But watch out as your data grows. Because the background save (BGSAVE) is performed by a child through a fork, if there isn’t enough free memory you will see this in your redis log:

Can’t save in background: fork: Cannot allocate memory

Here’s the full explanation from the Redis FAQ:

Redis background saving schema relies on the copy-on-write semantic of fork […] A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can’t tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages […]

To fix, change the overcommit memory setting:

sysctl vm.overcommit_memory=1

As the data grew, the fork took longer and during this Redis would not respond to any incoming requests. For 20GB of data, it took several seconds to initiate the BGSAVE operation. We are using predis to talk to Redis and to handle this unresponsiveness we increased the socket timeout from the default of 5s. Obviously, waiting for a few seconds is not ideal, so we reduced the frequency at which the data is dumped to disk so it is done only once an hour. To improve this, we can use the AOF method of backup, or we can replicate the data to a slave dedicated for backups.

A great feature in Redis is pipelining, something that comes in handy when you need to write out a lot of data. This is exactly what happens when an event occurs and we need to push it out to many user’s news feeds. After pushing the event, we also check and prune old events to prevent the feed from growing boundlessly (i.e. get rid of old news). With this approach, each new event will make at most two round trips to Redis, no matter how many users subscribe to the event.

$pipe = $redis->pipeline();     // start a new pipeline
foreach ($user_ids as $id) {
// push to user's news feed
   $pipe->lpush(Feed::key($id), $event_id);
}

// result from each lpush is the number of items after the operation
$push_result = $pipe->execute();   

$pipe = $redis->pipeline();     // start another pipeline
foreach ($user_ids as $i => $id) {
    if ($push_result[$i]>10000) {
       $pipe->ltrim(Feed::key($id), 0, 9000); // trim 1000 items
    }
}
$pipe->execute();

We’ll be using Redis more but we also need to work on improving the backup process as well as adding data sharding. We haven’t found a great solution for sharding yet, although the folks at Craigslist are doing this and it seems simple enough. If anyone has other suggestions, we’ve love to hear it.

Ivan Y