Through the needle’s eye: Data science in production

Data science is often associated with fast and dirty data analysis + machine learning solutions that do not follow the software engineering practices. Many regard data science more like a Swiss army tool that combines incompatible data and software components in impromptu ways.

A data science project starts usually with webscraping or reading data from datalakes. Data thus collected need to be formatted and pre-processed using multiple tools (command-line tools, shell scripts, Python scripts, data cleaners etc.), and then loaded into an interactive environment (be it notebook, or IDE) for exploration. After trying out several strategies for data analysis, a data scientist presents her insights in a graph form or communicates the results to the management on the paper or with slideshows. A successful data model can be then transferred to backend engineers who re-implement it in production. The process is messy, but it does not have to be like that!

These days most of software startups use the agile development strategy, i.e. they release early and release often. Therefore, there is little time to move trough the standard process of designing data algorithms. The new approaches developed by a data scientist should be directly deployed into production. For some backend engineers it might sound like a recipe for disaster, but it’s not necessarily so, as long as we, data scientists, adhere to a few simple practices.

These days most of software startups use the agile development strategy, i.e. they release early and release often. Therefore, there is little time to move trough the standard process of designing data algorithms.

Make code not words

The question of programming languages evokes extreme emotions in most developers. Data scientists tend to choose languages that offer short development cycles (no compilation, dynamic typing etc.) and high-quality data munging libraries. On the other hand, software engineers prefer languages that offer reliability, performance, reproducible build tools and provide high-quality web frameworks.

These requirements are not necessarily exclusive, and you may find a few languages that fulfill the needs of data scientist and engineers alike. Personally, I would choose Python but there are other popular choices (for example, Scala). Therefore, I would recommend aspiring data scientists to invest their time into learning some of these languages together with a versatile editor (like Atom) even if it means leaving their comfort zone of data-aware IDEs and interactive notebooks (unfortunately, executing Jupyter notebooks is still an open problem).

Thus said, a serverless approach such as Amazon Lambda offers an alternative to one-to-rule-them-all language. By providing encapsulated environments, lambdas stay language-agnostic, provide a coherent interface for data interchange and can be easily combined with functions implemented in other languages. Although very tempting, such microservice architectures remain challenging it terms of maintainability and often monolithic single-language architectures are preferred.

I would recommend an aspiring data scientist to invest their time into learning some of these languages together with a versatile editor (like Atom) even if it means leaving their comfort zone of data-aware IDEs and interactive notebooks

In the world of coding styles

No matter what language you choose, you absolutely must adhere to common coding conventions. These conventions specify how classes, functions and variables should be named (lowercase, UPPERCASE or camelCase, with or without underscores etc.), how to indent code blocks, where and in which order to include headers and import external libraries. If you don’t know these conventions, ask you colleagues, they will most likely be able to provide you with pointers to the coding style guidelines (such as PEP8 for Python devs) or even give you configuration files for automatic checkers called linters.

Linters (such as pylint, pep8 etc.) are usually integrated in your editor (or relevant plugins can be separately installed), and they go through your code as you type (or when you save a file) highlighting any departures from coding conventions.

Adherence to coding conventions improves readability and maintainability of the code. Also the future yourself will thank the present you for keeping the code clean and consistent.

Code peer review – on the shoulders of giants

While working on data projects, we often neglect tracking the changes in the code. I don’t mean here old-fashioned, but widely practiced copying files to folders with version names, but rather using a dedicated version control system. Failure to use such systems can cause some severe problems with reproducibility of you analysis workflow (for example, you might not be able to recover the parameters that gave the best result of your prediction). A good version control system such as git is the solution to circumvent such problems. Although, there are several others, Git is the industry standard, and it also powers the popular social platform for coders called Github.

Github is a platform for collaborative code editing (but it’s not limited to computer code and can be as easily used for documents, 3d models, graphics etc.). One of its killer features is so-called pull request (PR) model, which allows any developer to propose changes to the project which she does not own. These changes can be later reviewed by other contributors and either corrections are suggested or the PR is accepted and merged with the main (master) branch. This is a common model for development of many open-source libraries (git itself was originally created by linux kernel developers) and it’s also widely adopted in the tech.

The essential step in the process is the code peer review, i.e., independent check of the code by another developer. There are many advantages of such code review. Most importantly, it allows one to catch bugs early. Secondly, it enforces the coding conventions. Lastly, it’s a form of peer learning making developers learn from each other.

[Code peer review] is a form of peer learning making developers learn from each other.

Therefore, next time make sure to ask a developer who did not contribute to your patch (like a Backend engineer) to review your code. Also, when you are asked about the same, never refuse (even when you think that the code is outside your competences) and be diligent about your review!

Test-driven development is not an oxymoron

The “test-driven development” or TDD might sound a bell. You might have heard that some people write tests even before implementing the functionality. This is not a legend.

Although you don’t have to go all-TDD, testing, and in particular, unit testing improves drastically the code quality even if tests are written after the function. You might think it a waste of time, but believe me, it will save you more time than any autocompletion feature or new array computing library. The “unit” part of unit testing means that you only test a single feature of your code and abstract it from all other functions of the code. It means that you should test outputs of the function for sample arguments. These outputs might be either a return value, data stored in a database, or a call to another function. What comes handy, are various mocking libraries, which allow you to replace some functions with “mock-ups” that can be easily accessed and their calls compared to the expected ones. Writing good test cases might be an art in itself, but I think that you already practice it when you test you data algorithms or machine learning models with minimal, artificial inputs. But don’t forget the corner cases!

You might think it [unit testing] a waste of time, but believe me, it will save you more time than any autocompletion feature or new array computing library.

Continuous integration is better than free lunch

Continuous integration is a system that runs the whole test suite on each commit, usually on a dedicated server in a clean environment. Therefore, it allows you to detect any bugs in the code introduced by your change (called regressions) before they are pushed to production. Continuous integration is not a way to catch you red-handed, but rather to release you from worries about releasing untested code. If your code coverage is nearly 100% (it’s an asymptote so don’t ever hope to reach it and you should not trust blindly code coverage), you can be almost sure that your change will not break the application.

Continuous integration is not a way to catch you red-handed, but rather to release you from worries about releasing untested code.

Personally, I treat unit tests as a piece of documentation – they document the protocol under which the software components should interact and under what conditions they may be run. Continuous integration is a way to enforce the protocol in new commits before they are merged.

If some bugs still manage to slip through the narrow filter of unit testing and continuous integration, make sure to reproduce them in a well-tailored test case before fixing them, so you can catch them early next time.

Continuous integration systems are tightly coupled with the build systems, which compile and package your code for distribution and sometimes even deploy it to production. Build systems create reproducible environments by obtaining the code and its dependencies in well-defined versions from reliable sources. Any problem in production then can be traced back to the particular version of the code or its dependencies. These days this is usually achieved through distributing the code as Docker containers.

Quality comes first

Yes, it’s still about testing. Quality is of paramount importance for all systems running in production. If you deploy a machine learning model to the production you must make sure that it will handle its peak traffic, that it will not introduce security vulnerabilities and data leakages, that it will run under all conditions and with all kinds of possible inputs (especially if its inputs are exposed to the external world through a public API). Even the best programmer, the best data scientist and the best continuous integration system will not ensure the total reliability of the data algorithm, or another code. Therefore, testing and monitoring production systems are essential. A few important approaches can be listed:

dogfooding – using your own product and checking all possible corner cases is vital. Organize frequent dogfooding sessions to which you should invite people from outside of your team.
error handling – make sure that you have a system in place that will catch any unhandled exceptions raised in production and will report them back to you (Sentry is a popular and proven solution).
stress testing – try to test you application under realistic traffic (could be simulated traffic or a replay of historical data).
performance monitoring – monitor all API calls, compute resources, database operations and identify bottlenecks quickly.
system monitoring – this is mostly for dev ops, but you should be aware of any potential points of failure in the system (database crashes, compute node overload, memory leaks, hardware failures).
security monitoring – although this comes last, it should be probably the first thing to think about. If you handle sensitive data (even as trivial as usernames), you must know that there are ways that bad agents can retrieve them without your permission and use them to target your users. Therefore, make sure to run a live protection over your application. There is myriad of solutions, but my favourite is obviously Sqreen.

Even the best programmer, the best data scientist and the best continuous integration system will not ensure the total reliability of the data algorithm, or another code. Therefore, testing and monitoring production systems are essential.

Communication is the new oil

No matter whether you are already familiar with all the points above and ran thousands of systems in production, you should always communicate with you colleagues and clients. Make sure that they know of the new data algorithm that you released today or planning to release (also try to explain in simple terms how it works). Be first to catch a potential vulnerability in the code and communicate it to security owners (if your company does not have a dedicated security team, talk to your CTO or senior engineers, don’t discuss it in public). If you realise that your recent release may contain a loop hole or introduces a memory leak, don’t wait with reporting it to your colleagues. Make sure to explain the magnitude and possible consequences of data leakage in you application to the team, clients and users. And most importantly, remember that everyone needs to learn and everybody makes mistakes, admitting it to yourself and others leads you towards mastership.

[…] everyone needs to learn and everybody makes mistakes, admitting it to yourself and others leads you towards mastership.

Acknowledgment and disclaimer

This advice comes from my experience as a software instructor at Carpentries bootcamps and Advanced Python Programming schools, my contributions to open source software (such as NumPy, matplotlib or recently pandas), but also from my recent adventure as a senior data scientist / Backend engineer at Sqreen! Nonetheless, the content of this post reflects solely my views and was not endorsed edited by other members of these communities.

If you have any feedback on the blog post or find an error, please contact me ASAP!

Bartosz Teleńczuk

freelance data scientist