overview

This post goes into my journey of making a cryptocurrency trading bot on AWS. The core code for this project is not and will not be published, but I do share lots of code on my GitHub account. Rather, this post details my approach, the roadblocks I hit, and the things I learned.

table of contents

This is a long post. Here are some links to its sections:

ruthless prioritization

Over the years, I’ve taken on more projects than I can count. And I’ve finished maybe 2 of them. So for this project, I set a well-defined goal along with a deadline.

Goal: Have a running bot that is automatically buying and selling by next Friday at 7:00pm.

I set a hard deadline for myself. This way, I forced myself into making decisions that were Good Enough. I had a bunch of stuff going on after Friday, so I needed to finish it by then or else development velocity would take a huge hit. It forced me to follow the corollary to the Good Enough principle: the Pareto principle (aka: the 80/20 rule). I needed to spend 20% of my time/energy gaining 80% of the results.

I did not hit my goal. But I came really close. Here is the general process. Note, the first few days were before I set my initial goal. I was still trying to get some time off of work so I could dive into this in a focused way, so I only spent a few hours on the project for each of the first few days.

  • Day 1: Design. PlantUML diagrams. Define each component by its purpose.
  • Day 2: Refine design. Investigate deployment strategies.
  • Day 3: Socialize the idea with potential customers. Start massaging a previous project (one that notified me of potential opportunities to buy) so it could be used to consume data for this project.
  • Day 4 (Saturday): Spin up Redis & RDS. Adjust code to use these.
  • Day 5: Finish up “receiver”, “persister”, and “cleaner” (three components of the design). Let the bot run overnight.
  • Day 6: Wake up to everything having crashed due to poor resource management. Realize that resource management is hard. Really hard. Debug code. Rearchitect implementation. Let it run overnight.
  • Day 7: More bugs. Debug, fix more bugs. Start work on implementing 3 trading indicators (Simple Moving Average, Exponential Moving Average, and Bollinger Bands).
  • Day 8: More work on trading indicators. Minor tweaks to resource usage.
  • Day 9: Finish trading indicators. Look into how to use them to assist in deciding whether or not to make a trade. Realize my personal trading strategy has nothing to do with these indicators. Come up with a much simpler trading strategy that uses no indicators.
  • Day 10: Remove indicator-related code (ie: throw out a ton from days 7-9; this was a massive hit). Implement simple trading strategy & let it run overnight.

At this point, the goal had not been reached. I had my trading strategy implemented, but now I was facing some serious resource intensity issues. I had to bump up one server from a t2.medium to an m4.large since my code was not efficient at all. Even then, my code was taking way too long to run.

The other problem was that I had run out of runway for my 10-16 hour days spent on this project. I needed to go back to my full-time job, so velocity took a huge hit.

  • Week 2 highlights:
    • Profile a long-running component and reduce execution time from 3 minutes to 30 seconds.
    • Add in wrappers to work with the Coinigy REST API.
    • Add in logic to create buy & sell limit orders.
    • Make some hardcoded values configurable via environment variables (see https://12factor.net/config).
  • Week 3 highlights:
    • Test automatic SELL orders after manually making BUY orders via the Coinigy UI.
    • Let the bot run without making real trades. Verify it would be making trades that I want it to make. In other words, validate the correctness of the implementation of the algorithm.
    • [day 18] Enabled real BUY orders. This was the first day it went fully live with both BUY and SELL orders being automatically created. It earned me 0.00009 BTC in its first waking hour. That’s a whole 37 cents!
    • Lots of debugging bugs related to Coinigy/Bittrex/Poloniex interfaces: 502/503/IP safelist errors, etc. My code was buggy enough to make me not leave it running unattended for more than a few minutes. Eventually, I just turned off the trading component altogether.
    • Tweak the logic around when to buy/sell/cancel stale buy orders, etc.
    • Run code analysis tools against the code. Find out it is pretty bad.
    • U. N. I. T. T. E. S. T. S.
    • Shut down all AWS resources while writing unit tests.
  • Week 4
    • More unit tests. And then a few more. Added some unit tests. After that, I wrote some tests for small units of code, along with some unit tests.
    • Refactored some socketcluster code to communicate via channels. This rearchitecting effort improved the design and therefore removed a bunch of bugs caused by race conditions.
  • Everything else (months of work)
    • Re-wrote the whole thing using a much more modular design.
    • Wrote far simpler tests for core portions of the code.
    • Reduced the complexity, and improved the extensibility, maintainability, and modularity of the code.
    • Reduced the overall bill.

keep it lean

Given that this was a new project of mine and I wanted to maybe learn 1 or 2 new technologies along the way, I thought it might be interesting to investigate some hip new tech that’s available today. I time-boxed an investigation for deployment strategies. I investigated:

  • Docker
  • EC2 container/registry service
  • Kubernetes

After looking at all the steps involved, the pros/cons of each, watching some YouTube videos (including this great one by Kelsey Hightower), reading some papers and write-ups on best practices, I decided to go forward with this solution:

  • rsync

I realized that go program distribution is about as simple as can be. You compile it and share it. rsync works great for that. Sure, it takes a few seconds to transfer a few megabytes of data, but I’m willing to accept that for the short term while trying to get a solution out the door over the course of a week. Here is a snippet from my development deploy script:

# Get the server IP.
analyzer_ip=$(aws ec2 describe-instances \
                  --filter Name=tag:Name,Values=analyzer-dev \
                           Name=instance-state-name,Values=running \
                  | jq -r ".Reservations[].Instances[0].PublicIpAddress")

# Select which utilities to upload.
analyzer_utilities="trade"

# Build the utilities and upload them.
for util in $analyzer_utilities; do
    pushd cmd/$util
    GOOS=linux go build
    rsync -avzhe ssh . ubuntu@$analyzer_ip:/home/ubuntu/$util
    popd
done

For my other projects, I’m using Concourse CI. It has a really slick set of features that I absolutely love. It’s also totally unnecessary when developing and rolling out updates constantly (meaning every 1 or 2 minutes). For the initial burst of energy in setting up this trading bot, it was overkill.

With all that said, I traveled for a work event, and so my comfortable fast Internet speed I’d been using for rsync disappeared, so the fastest dev process became to utilize AWS<->AWS resources as much as possible. This meant finally building CI into my process. You can see the resulting Concourse setup here:

It’s pretty straightforward:

All I had to do from that point to test a new code change was:

  1. Commit and push.
  2. Wait for a build to complete (usually a matter of seconds).
  3. SSH onto the server and run a script that pulled down the latest release and started it.

The last step could probably be automated, but it was Good Enough.

how to bleed money

Running in the cloud costs money. I used AWS for this project. To get an estimate of the money I would end up paying, I used this fantastic site: ec2instances.info

For my original design, this is the cost breakdown:

  • RDS: $12.410 monthly (db.t2.micro on demand)
  • EC2 instance #1: $73.000 monthly (m4.large on demand)
  • EC2 instance #2: $16.790 monthly (t2.small on demand)
  • ElastiCache: $24.888 monthly (cache.t2.small)
  • Total Expenses: $127.088 monthly

This bill was not pleasant. Further on in this blog post, I describe a bit of a refactoring that took place. At the point of publishing this post, I now have the following cost estimate:

  • EC2 instance #2: $16.790 monthly (t2.small on demand)
  • Total Expenses: $16.79 monthly

profiling saves money

One component of the trading bot runs every minute. It looks at the available data and tries to answer the question: “Should I make a trade?” So it’s kind of important.

The problem I encountered was that this component that runs every minute… took about 3 minutes to run. Not very ideal.

I tried to change my code a bit in certain areas that I thought might be taking a long time, such as:

  • disk writes
  • reads/writes to/from Redis (perhaps network latency was an issue)
  • reads from MySQL

Nothing really improved. I checked CPU usage on both ElastiCache and RDS, but it seemed pretty low. I was running out of ideas.

Then I remembered a thing that hallowed neckbeards from the mount of High Hrothgar often talk about: CPU profiling. Google showed me this post by Rob Pike himself: Profiling Go Programs

I ran go tool pprof and here are the results:

(pprof) top5
120.63s of 186.43s total (64.71%)
Dropped 291 nodes (cum <= 0.93s)
Showing top 5 nodes out of 64 (cum >= 7.08s)
      flat  flat%   sum%        cum   cum%
    63.30s 33.95% 33.95%    138.86s 74.48%  strings.Map
    29.77s 15.97% 49.92%     36.68s 19.67%  runtime.mallocgc
    10.47s  5.62% 55.54%     10.47s  5.62%  unicode/utf8.EncodeRune
    10.01s  5.37% 60.91%    154.64s 82.95%  main.DoEtpCombo.func1
     7.08s  3.80% 64.71%      7.08s  3.80%  unicode.ToLower

Translation: this lambda function (main.DoEtpCombo.func1) took around 83% of the execution time:

func(t *coinigy.Trade) bool {
    return strings.ToLower(t.Exchange) == etp.Exchange &&
        strings.ToLower(t.Label) == etp.TradePair
}

I can 100% confirm that I would never have considered that this function was the culprit of about 83% of the execution time of my program. After Googling a bit for a functional equivalent (string comparison while ignoring case), I came up with this replacement code:

func(t *coinigy.Trade) bool {
    return strings.EqualFold(t.Exchange, etp.Exchange) &&
        strings.EqualFold(t.Label, etp.TradePair)
}

And here are the updated results:

(pprof) top2
29.74s of 54.62s total (54.45%)
Dropped 251 nodes (cum <= 0.27s)
Showing top 2 nodes out of 102 (cum >= 30.29s)
      flat  flat%   sum%        cum   cum%
    25.67s 47.00% 47.00%     26.22s 48.00%  strings.EqualFold
     4.07s  7.45% 54.45%     30.29s 55.46%  main.DoEtpCombo.func1

func1 originally took 154.64s, but after the update, it took 30.29s. In other words, I saw an 80% drop in time to complete this one function. That’s decent, but when I think about paying AWS $36.50/month to do string processing, I don’t feel so great. I tried to make it even better.

At this point, I thought a bit about how I should have paid a bit more attention to my data structures and algorithms professor. Then it occurred to me that I didn’t ever need that value to be capitalized. So I simply converted it from caps to lowercase as soon as I saved it.

Hm. Still not the best results:

(pprof) top2
22.29s of 50.27s total (44.34%)
Dropped 228 nodes (cum <= 0.25s)
Showing top 2 nodes out of 107 (cum >= 22.69s)
      flat  flat%   sum%        cum   cum%
    18.30s 36.40% 36.40%     18.30s 36.40%  runtime.memeqbody
     3.99s  7.94% 44.34%     22.69s 45.14%  main.DoEtpCombo.func1

I shaved off 8 seconds, but I still wasn’t happy. At this point I was desperate, so I added in a simple hash value of the combined strings. In my mind, I thought string comparisons would need to take longer than regular number comparisons since there is a lot of potential for extra comparison operations taking place in a string (one for every matching character). However, I’d rather test that theory than just guessing. Here are the results:

(pprof) top1
9.05s of 37.13s total (24.37%)
Dropped 188 nodes (cum <= 0.19s)
Showing top 1 nodes out of 146 (cum >= 9.05s)
      flat  flat%   sum%        cum   cum%
     9.05s 24.37% 24.37%      9.05s 24.37%  main.DoEtpCombo.func1

9.05s down from 154.64s. That translates to about a 94.15% drop. Not bad dot jay peg.

Not Bad

At this point, the program completes in just over 30 seconds. For something that needs to execute just once a minute, that’s Good Enough for me.

I’d also like to point out a couple other features of the go profiling tool, which made other investigations very useful for me. You can see “hot spots” where CPU execution spends a long time on certain lines of code using the list and weblist commands:

(pprof) list FillCandleGaps
Total: 22.19mins
<snip>
     670ms  15.53mins (flat, cum) 70.00% of Total
         .          .     64:}
         .          .     65:
         .          .     66:// InsertCandle inserts a candle at the specified index.
         .          .     67:// https://github.com/golang/go/wiki/SliceTricks#insert
         .          .     68:func (m *Market) InsertCandle(x Candle, i int) {
     170ms  15.35mins     69:   m.Candles = append(m.Candles[:i], append([]Candle{x}, m.Candles[i:]...)...)
         .          .     70:}

Executing the weblist FillCandleGaps command gave me this:

With this weblist view, you can even click on individual lines of code and have it show you the specific assembly instructions that are executed for that line. This can be super useful.

Lastly, and this tip could have saved me some time sifting through all the output I showed above, you can sort the output by cumulative time spent by using the -cum flag. For example:

(pprof) top10 -cum
Showing nodes accounting for 2865.99s, 19.64% of 14592.51s total
Dropped 1097 nodes (cum <= 72.96s)
Showing top 10 nodes out of 148
      flat    flat%     sum%        cum   cum%
         0       0%       0%  13130.43s 89.98%  main.makeBuyLoop /<snip>/main.go
     0.07s 0.00048% 0.00048%  12974.68s 88.91%  <snip>.Buy /<snip>/buyer.go
     0.05s 0.00034% 0.00082%  11678.52s 80.03%  <snip>.buyMarket /<snip>/buyer.go
   372.63s    2.55%    2.55%  11677.85s 80.03%  <snip>.FillCandleGaps /<snip>/market.go
     0.06s 0.00041%    2.55%   6317.35s 43.29%  sort.Slice /<snip>/sort.go
    14.62s     0.1%    2.65%   6314.70s 43.27%  sort.quickSort_func /<snip>/zfuncversion.go
   618.14s    4.24%    6.89%   6270.08s 42.97%  sort.doPivot_func /<snip>/zfuncversion.go
    73.71s    0.51%    7.40%   3510.88s 24.06%  runtime.systemstack /<snip>/asm_amd64.s
  1713.08s   11.74%   19.14%   2966.48s 20.33%  <snip>.FillCandleGaps.func1 /<snip>/market.go
    73.63s     0.5%   19.64%   2799.63s 19.19%  runtime.mallocgc /<snip>/malloc.go

In this way, you can easily spot your biggest bottlenecks.

code analysis

My code was so bad. Like real bad. I forgot to take before/after tests to share, but let’s just say there were lots of issues. To find them, I ran a few tools.

go tool vet .

go vet was able to find a race condition that I can honestly say I never would have found just by looking at the code.

gofmt -l .

gofmt found nothing because I have an emacs plugin that auto-formats on every save. That said, I will keep it in my build scripts just in case the plugin misses something.

go get github.com/fzipp/gocyclo; gocyclo -over 10 .

gocyclo identified a bunch of great findings, which forced me to tackle some functions that had become way too unwieldy. I even found a copy/paste bug that had been making me scratch my head for days.

A side effect of having less complex functions is that they are much easier to test using unit tests. More on that later.

go get github.com/golang/lint/golint; golint ./...

golint found a bunch of great issues for code quality, but honestly, it was way too noisy for me to tackle all at once. I came back to this during a huge refactor and it became much more manageable.

go get github.com/gordonklaus/ineffassign; ineffassign .

ineffassign found some valid issues, which were fairly easy to fix.

gometalinter

And now we come to my favorite checker that I only found after writing more unit tests than I ever thought I’d write: gometalinter

go get github.com/alecthomas/gometalinter
gometalinter --install
gometalinter ./...

It found one of my favorite bugs courtesy of megacheck:

// Gather all the channels that are sent in through the channel channel. Yo dawg.
done := false
for !done {
	select {
	case c := <-channelCh:
		if c.IsLoaded() {
			channels = append(channels, c)
		} else {
			log.Println("Channel is not loaded: " + c.String())
		}
	case err := <-errCh:
		return err
	case <-doneCh:
		done = true
		break // warning triggered here
	default:
	}
}

This produced the following warning:

coinigy/messenger.go:133:4:warning: ineffective break statement. Did you mean to break out of the outer loop? (SA4011) (megacheck)

In case the warning isn’t clear, it is saying that the break statement was breaking out of the select statement, but not breaking out of the for loop. I wanted to break out of the for loop, so this code was incorrect.

I had no idea this was even an issue. I thought my code was working this whole time. I have absolutely no idea how it worked with this bug present.

It turns out there is a much more elegant way to break out of this type of loop. Stack Overflow is your friend.

// Gather all the channels that are sent in through the channel channel. Yo dawg.
LoadChannelLoop:
for {
	select {
	case c := <-channelCh:
		if c.IsLoaded() {
			channels = append(channels, c)
		} else {
			log.Println("Channel is not loaded: " + c.String())
		}
	case err := <-errCh:
		return err
	case <-doneCh:
		break LoadChannelLoop
	}
}

If I was doing a code review, there is no way I would have spotted that.

Another feature that I absolutely loved is brought to you by the errcheck linter. It will detect when error return values are not checked. Let’s just say that I did not check for a bunch of errors. Again, I’ll say I have no idea how my code was able to run at all.

unit tests

Around the end of week three, I was nearly done with the Good Enough version of the bot (mind you, still two weeks past my self-assigned due date). However, I still was encountering very strange bugs. Sometimes the selling price was automatically set too high or sometimes my available balance didn’t quite refresh properly. Sometimes timestamps weren’t being properly parsed because objects hadn’t been initialized yet. Lots of weird bugs.

So I finally gave in and wrote some unit tests. At first, I was going to use testify because it looked decent, but then after a bit of Googling, I came across this excellent blog post by Dan Mullineux. He provides a great argument for using the standard testing library that comes with Go. So that’s what I did.

However, when writing my first unit test (something for my socketcluster implementation), I thought, “How would I mock out the Coinigy websocket server?” Google brought me to this slide.

Go eschews mocks and fakes in favor of writing code that takes broad interfaces. For example, if you’re writing a file format parser, don’t write a function like this:

func Parse(f *os.File) error

instead, write functions that take the interface you need:

func Parse(r io.Reader) error

(An *os.File implements io.Reader, as does bytes.Buffer or strings.Reader.)

After looking up the word “eschew”, I realized that since I had written zero interfaces, I was going to have a bad time with this.

gonna have a bad time

I’d known that writing testable code lead to better implementations and highly cohesive/loosely coupled code, but I ignored it until I started encountering really hard to debug issues. It was finally time to face the music.

I had to admit that my current idea of Good Enough was not quite… good enough. I had to step up my definition. So I put on an immediate feature freeze and started a massive refactoring. This involved:

  • adding thorough unit tests that validate return values, the state of objects, and the specific error messages that are raised
  • hopefully, reaching 100% code coverage of the most important portions of the project
  • using golint to slowly, but surely, document all my code for my future self

“good enough” has its limits

While I applied the Good Enough principle, I applied it liberally. Far too liberally. When all the various components I wrote were written using this mindset, the end result was waaaay less than Good Enough. The subliminal quality of each component has a multiplicative effect of a very low-quality project. Eventually, this lead to many bug hunts, many panics, and many refactoring commits, and some heavy rearchitecting of APIs. I think a lot of that was due to my lack of experience with proper Golang API design, but other portions of it were due to me using Good Enough as an excuse for low quality work. This is something I’m still mulling over.

go interfaces

From http://openmymind.net/Things-I-Wish-Someone-Had-Told-Me-About-Go/

Passing a focused interface to a function helps ensure that you aren’t doing too much. It’s one of the best tools I’ve seen that reduces the amount of refactoring you have to do.

There is the concept in the software world called “SOLID” design. Dave Cheney gave an excellent talk called “SOLID Go Design”. You can read the blog post and watch his presentation. I wouldn’t do it justice if I were to try to summarize it, but I will pick out this tidbit that was a quote from someone else. I have found this to be incredibly useful for writing testable code:

A great rule of thumb for Go is accept interfaces, return structs.

– Jack Lindamood

I highly recommend doing your own investigation into this concept. While it is a very short portion of this blog post, it had the most impact on the quality of my code overall.

insert breakpoint

I would like to take a moment and mention that in writing this sentence, it has been over 4 months since I updated this blog post. I took to heart the bit about writing testable code and using interfaces (hint: those two things are two sides of the same coin). The Coinigy API proved to be horribly unstable, non-deterministic, and just a general pain to use. So I decided to use an exchange’s API directly. For this, I chose Bittrex. Their API is phenomenal. I re-wrote my entire bot and designed it after buying Design Patterns and reading certain patterns that exactly matched issues I was facing. My code is now maintainable, testable, modular, and extensible. It is vastly improved from my original rush to get the code out the door.

I have used poorly designed and well-designed libraries. I never was able to identify why some libraries seemed better than others until I started learning about Design Patterns. I still have lots to learn and will continue making mistakes for years to come, but hopefully, these design patterns will make things slightly less awful.

conclusion

Thanks for reading! I hope this has been of use to you in one way or another. I am nowhere near done with my project, but it has been positively transformed due to the lessons I learned along the way, and I’m sure I’ll learn many more lessons.

Hopefully, something from this blog post helps you with your next adventure with Golang!

appendix

This section contains some random things that didn’t really fit well in the narrative above.

blocking in go

The following two blocks of code semantically cause a goroutine to block.

select {}
for {}

However, the second block causes 100% CPU usage. Throwing a time.Sleep(whatever) in there would help, but it’s not as elegant as a simple select{} statement.

error handling

For a while, I had used my own error handling wrapper functions that printed an ongoing stack trace. They were extremely simple:

func traceEnabled() bool {
    v := os.Getenv("TRACE")
    return strings.ToLower(v) == "true"
}

// Error returns the function name, line number wherever Error() is called, and
// the supplied error message. This is most useful when creating a trail of
// errors across multiple goroutines.
func Error(err error) {
    if !traceEnabled() {
        return
    }

    pc := make([]uintptr, 15)
    n := runtime.Callers(2, pc)
    frames := runtime.CallersFrames(pc[:n])
    frame, _ := frames.Next()
    log.Printf("%s:%d: %s\n", frame.Function, frame.Line, err.Error())
}

This worked really well for a while. I would simply call this function within the typical if err != nil block like so:

if err != nil {
    Error(err)
    return err
}

That’s straightforward enough. I could have even spruced it up to return the error so I could simply do something like this:

if err != nil {
    return Error(err)
}

However, after seeing code in multiple unrelated projects that is very similar to the example above, I investigated further. Many of them used this project: https://github.com/pkg/errors

It turns out that Dave Cheney wrote this package. To use it, simply import his errors package and use this little snippet, providing whatever context you want:

if err != nil {
    return errors.Wrap(err, "simple error description")
}

If the root error is wrapped multiple times using this code and you print the stack trace (described later), you will get multiple stack traces: one trace for every time at every point where the error is wrapped. At first, I didn’t like all the extra output, but then I found that it significantly improved the available information that I had when trying to diagnose bugs.

To get the stack trace, you simply call this on the bubbled up error:

fmt.Printf("%+v", err)

It will print the stack trace for each error that was wrapped (so there may be multiple, given that err values tend to go through multiple if err != nil clauses).

For more information, check out Dave’s blog post and presentation.

signalr

After my massive refactor, I open sourced one component of my bot (https://github.com/carterjones/signalr). This is a Golang implementation of a portion of the client side of the SignalR specification (Bittrex uses SignalR for their WebSocket API).