
A Pilgrim’s Progress #4: Panda Series

This is the fourth in a series of posts charting the progress of a programmer starting out in data science. The first post is A Pilgrim’s Progress #1: Starting Data Science. The previous post is A Pilgrim’s Progress #3: NumPy

I’m trying something new out here. These posts are coded in Jupyter which is an extremely handy way to intermingle text and executable code. It comes with Anaconda, which is the best way to get everything going if you’re starting out. For the first couple I cut-and-pasted the material over to WordPress. This time I downloaded the Jupyter file as HTML and pasted it in. Far from perfect but 100x faster. It’s painful to edit once pasted in, so it’s far from a perfect solution. Any ideas?

Pandas are insanely versatile and capable of far more than I’ve covered in this already excessively long set of notes. At best this is a way to get an idea of how they work and a quick tour of what they look like in use.

Continue reading

A Pilgrim’s Progress #3: NumPy

This is the third in a series of posts charting the progress of a programmer starting out in data science. The first post is A Pilgrim’s Progress #1: Starting Data Science. The previous post is A Pilgrim’s Progress #2: The Data Science Tool Kit.

What Is NumPy?

NumPy is a library of high-performance arrays for Python. After this I’m going to mostly call it numpy because that’s the name of the package you import. Whatever we call it, numpy supports creating and manipulating arrays of any number of dimensions and the ability to easily reshape them and slice them in complex ways on the fly.

The elements of any numpy array can be accessed in a variety of ways. You can access single elements, of course, but there is a powerful syntax for accessing all sorts of rectilinear slices in one or more dimensions. We’ll look at some of that below.

As the name implies, numpy is designed to support mathematical computing, and is thus packed with convenient features for operating on data as an array or matrix.

Every programmer is used to iterating over the elements of an array using a loop or an iterator, which is a concept that is easily extended to using nested loops to iterate over multi-dimensional structures. Numpy takes a higher-level approach, emphasizing applying operations to an entire array, rather than merely using an array as a repository for data that will be explicitly operated on by loops in your code. Functionally, the two approaches are of equal power–there’s still a loop going on within numpy, but in practice, applying functions to data structures results in simpler, cleaner code that’s easier to understand. The way I look at it is, code you don’t have to write has the fewest bugs, so the less code the better.

Continue reading
algorithms, data science, data science career, Hadoop, machine learning, Uncategorized

A Pilgrim’s Progress #2: The Data Science Tool Kit

The is the second post about becoming a computer scientist after a career in software engineering. The first part may be found here.

Only a student would think that software developers mostly write computer programs. Coding is a blast–it’s why you get into the field–but the great majority of professional programming time isn’t spent coding. It goes into the processes and tools that allow humans to work together, such a version control and Agile procedures; maintenance; testing and bug fixing; requirements gathering, and documentation. It’s hard to say where writing interesting code is on that list. Probably not third place. Fourth or fifth perhaps?

Linear PCA v nonlinear Principle Manifolds Андрей Зиновьев=Andrei Zinovyev

Fred Brooks famously showed that the human time that goes into a line of code is inversely-quadratic in the size of the project (I’m paraphrasing outrageously.) Coding gets the glory, but virtually all of the significant advances in software engineering since Brooks wrote in the mid-1970’s have actually been in the technology and management techniques for orchestrating the efforts of dozens or even hundreds of people to cooperatively to write a mass of code that might have the size and complexity of War and Peace. That is, if War and Peace were first released as a few chapters, and had to continue to make sense as the remaining 361 chapters come out over a period of many months or even years. Actually, War and Peace runs about half a million words, or 50,000 lines, which would make it quite a modest piece of software. In comparison, the latest Fedora Linux release has 206 million lines of code. A typical modern car might have 150 million. MacOS has 85 million. In the 1970’s four million lines was an immense program.

Continue reading

A Pilgrim’s Progress #1: Starting Data Science

This is the first of what I hope will be a series of many posts documenting a pilgrim’s progress from programming to data science.

First of all, let’s talk about the name. It is almost a rule that anything called <something>-science isn’t a science and that will hold here. Science is defined as “an intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment.”

Nothing about data science fits that definition. It’s in the same boat with disciplines like library science, political science, management science, rocket science, and computer science that use mathematics and/or science to do interesting things but aren’t science themselves.

While the sciences study the world itself, data science studies the techniques for understanding the world through data. Data science is applied to some concrete field, be it science, politics, or advertising, but you wouldn’t say it’s “advertising science.” Of course not–it is its own thing. Trying to fit it in under the heading of science is what philosophers call a category error, like considering the manufacturing of firearms to be branch of wildlife management.

Continue reading

Not A Review: System76

I don’t usually do product reviews. Not that I’m against them, but this isn’t that kind of blog. I don’t buy or use a wide enough variety of computing equipment to have a valuable opinion. The truth is, as long as my personal computer is fast enough, I don’t have much reason to care about the nuances of processor tradeoffs, bus speeds, and the subtleties of graphics cards. Developing code actually isn’t very demanding in terms of hardware and when the code I write is deployed, it’s usually on swarms of anonymous generic servers managed by people I’ve never met.

What does matter at all levels is the operating system. The OS is the real computer. From inside a computer program you normally cannot see the hardware (unless you’re in a very esoteric field of programming.) All your code sees is the pretty face the OS puts on it. Still less can a user see the hardware. As long as there’s plenty of CPU and disk, the main thing you are aware of is the windowing system and the terminals. You occasionally have to do things that look like they involve hardware, like mounting disks, but even then, what you see is a layers-deep idealization provided by the operating system.

Continue reading

Go Go Go

I’m new to Go—just a few months in. I’ve spent a lot more time with Java, C++, even Python, but the Gopher is an interesting critter so far.  It’s not just a better version of <your favorite language here>.

Every language is a commitment to a particular way way of looking at programming but rarely more so than with Go, which is often politely described as “opinionated.”

Some Informal History

Go’s ancestor, the C language, which was invented in 1972, spelled the end of the age of assembler for systems programming. In the words of its author, Dennis Ritchie, C is portable assembler. Until C came along, you had to hand-port operating systems and other systems code to each platform you wanted to run on. There weren’t any IDE’s—it was just you and your editor back then.


A vast superstructure of software development tooling has evolved since C was young but less of it than one might imagine is about telling machines what to do.

There used to be a thing called “the software crisis” back in the 80’s and 90’s. Younger programmers may never have heard of it, but back then the majority of large projects were said to fail as the size of the problems we faced began to outstrip our ability to write commensurately large software.

Yet today, only the occasional overblown or ill-conceived project fails and success is the expectation. What happened?

It’s not the languages that have changed—I was a CS student in the late eighties and I have yet to encounter a significant language feature that did not already exist when I was an undergraduate.

It is tools and management techniques to facilitate people working together that beat down the software crisis, not high-powered syntax. Widespread adoption of Object Orientation allowed data models of unprecedented size to be developed and managed by large teams as they evolved over time. Open-source created a universe of high-quality computational Lego that let individuals or small teams produce incredibly powerful systems with little more than glue code. Agile and other management practices, powerful source control, documentation, tools like Maven, Jenkins, continuous integration systems, Jira, and of course, Unix for everyone.  These things are all about people, not machines. If you left it up to the machines, they’d write everything in C for portability and they’d skip the tabs or newlines.

Continue reading



Haxe is a most unusual language. So far, nobody I’ve enthused about it to has heard of it, which is a shame. I’m loving it. But before jumping into it, I want to give you some setup.

We had a problem. The company I’m with wants to flush data from hundreds of different kinds of IoT devices to the AWS Cloud. There are also Linux-powered gateways, a ton of code on the Cloud side plus Web browser applications. Among them, they use Python, C/C++, Java, JS, and PHP, and run on Linux, Mongoose, Microsoft, OSX, Android and even bare metal (the embedded controller-based devices, e.g. Arduino and ESP32, etc.)

Despite all these exotica, our problem is humble.  The messages the components send, at some point, are almost all represented in JSON, so we need some way to define that JSON centrally to ensure that all participants conform to the same schema and to make it testable. The best way to do this is to provide developers with standard objects—beans, in the Java world—that emit and accept the JSON.   But we don’t want to write and maintain the bean code in five languages as things evolve. How do we get around that?

Continue reading


Hadoop Cloud Clusters

If experience with Hadoop in the cloud has taught me anything, it’s that it is very hard to get straight answers about Hadoop in the cloud. The cloud is a complex environment that differs in many ways from the data center and full of surprises for Hadoop. Hopefully, these notes will lay out all the major issues.



No Argument Here

Before getting into Hadoop, let’s be clear that there is no real question anymore that the cloud kicks the data center’s ass on cost for most business applications. Yet, we need to look closely at why, because Hadoop usage patterns are very different from those of typical business applications. Continue reading


Big Jobs Little Jobs

You’ve probably heard the well-known Hadoop paradox that even on the biggest clusters, most jobs are small, and the monster jobs that Hadoop is designed for are actually the exception.


This is true, but it’s not the whole story. It isn’t easy to find detailed numbers on how clusters are used in the wild, but I recently came across some decent data on a 2011 production analytics cluster at Microsoft. Technology years are like dog years but the processing load it describes remains representative of the general state of things today, and back-of-the-envelope analysis of the data presented in the article yields some interesting insights.

Continue reading

Hadoop, Hadoop Hive, YARN

Live Long and Process

One great thing about working for Hortonworks is that you get to try out new features before they are released, with real feature engineers as tour guides—features like LLAP, which stands for Live Long and Process. LLAP is coming to Hortonworks with Hive 2.0 and (spoiler alert) it looks like it will be worth the wait.Live-Long-and-Prosper-Shirt

Originally a pure batch processing platform, Hive has speeded up enormously over the last couple of years with Tez, the cost-based optimizer (CBA), ORC files, and a host of other improvements. Queries that once took minutes now (1st quarter 2016) run at interactive speeds, and LLAP aims to push latencies into the sub-second range.

Continue reading
