In my previous Blog, Applying data science to policy, I talked about our first attempt at developing a data science software application to automate some painful parts of the policy consultation process. I alluded to challenges we faced and the fact that we could learn from these to deliver a better product and become a better data science team. This post gives a little more detail about our experiences. It’s basic stuff, but it’s easy to lose sight of that when you’re neck deep in code.
Keep it simple
One of the reasons the first iteration took so long is that we fell into the ‘shiny thing’ trap. As we developed, we encountered issues and we were often presented with 2 choices. We could do something quick and simple that resolved 80% of the issue, or we could invest the time and effort into a more complex solution that attempted to solve 95%+ of the issue. Often, we took the second path without trying the first.
One good example is around the removal of ‘junk text’ of no value to the analysis (for example signatures, virus scanning text). We spent a lot of time building a machine learning algorithm to solve this issue. It didn’t work as we’d hoped and when compared to a simpler pattern based system (regular expressions to the coders among you), the simpler system performed a lot better. The lesson? Simplicity often beats complexity. Start simple and iterate and beware the allure of shiny new things.
Processes
Photo by Kelly Sikkema on Unsplash
We also unwittingly adopted the ‘LGTM’ mentality. LGTM stands for ‘Looks Good To Me’ and is often used in software development circles to indicate a ‘light’ code review. Some of our code didn’t work as intended when we deployed it. This resulted in a lot of time identifying and resolving bugs and rewriting code, particularly for edge cases.
So, we’ve now adopted stricter code review processes and coding standards with a ‘little and often’ mentality. We supplement this with regular ‘show the thing’ sessions to review and learn from each other’s code and stricter testing principles to make sure that our new code doesn’t break the old code. There is still more to learn of course and it’s easy to think of this time as unnecessary. After all, reviewing someone else’s code isn’t fun. But it sure beats fixing someone else’s code. Or worse, fixing your own code from 6 months ago.
Deployment & reusability
Photo by Jakub Jacobsky on Unsplash
One of the things we got right was around deployment. Adopting a Cloud First Policy, we developed the application to be hosted in the cloud. This gave us a lot of freedom and presented new challenges. How could we quickly and easily deploy and manage our application which consisted of numerous databases, components and interconnected parts? Doing so manually would have taken us around 3 hours of downloading, deploying and configuring each time.
The solution was a technology called Docker. This was more work up front as we had to both learn and configure Docker for our application. But, once it was up and running, the benefits became clear:
- deploying the application was as simple as a few lines of code which pretty much anyone could do
- it made our application more shareable and reusable
- it saved us literally days of time during the development, testing and deployment process
- having learned it, we’ve applied it to other projects
Docker (https://www.docker.com/) is also free and open-source and widely used with a myriad of documentation and examples.
Always be data sciencin’
The last takeaway was around the role of the data scientist. At the time of building this, our digital department were undergoing a transformation to be more focused on innovation. They supported us with hardware and a cloud platform which was literally game-changing in terms of the possibilities it presented. But they didn’t have people in place yet to contribute, so we used a contractor and learned things ourselves.
It was fun to learn how to build web applications and we could apply these skills to our other projects. But this shouldn’t be at the expense of core data science skills including data processing, analysis and machine learning. One big lesson for me personally is that it’s very easy for your skillset to develop around projects. You should step back and take a more strategic view otherwise before you know it you’ve turned into a software developer.
So, for this year I’ve come up with a data science oriented development plan and I’m dedicating time in my diary to it. I’ll no doubt learn wider skills on the job too, and I’m fine with that as long they’re ‘as well as’ and not ‘instead of’ core data science skills.