11

Lessons learned from (almost) failing to deploy a simple Machine Learning model...

 3 years ago
source link: https://mc.ai/lessons-learned-from-almost-failing-to-deploy-a-simple-machine-learning-model-in-the-cloud/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Once we validated that this API worked locally, our next objective was to migrate this to a “free to use” and “infinitely scalable” AWS service. With that in mind, we took the path of using AWS API Gateway and AWS Lambda functions. Since we had a positive previous experience with these tools, we did not even bother checking out any limitations since we “knew” how to do it.

Our first issue consisted on trying to launch a Python Lambda with OpenCV and other dependencies. We learned that this could be done with Lambda layers and packaging our lambda function with its dependencies. You will find below the lessons learned after a lot of troubleshooting:

  • In order to download and package correctly lambda functions, you shall download the dependencies from an Amazon Linux-compatible system (i.e. running pip install -t . opencv-python from MacOS will download non-compatible binaries);
  • Lambda functions running on Python 3.8 cannot import and run OpenCV as system files such as libgthread-2.0 are not included in the runtime environment. Python 3.7 shall be used to run OpenCV.

After learning how to correctly download our script dependencies and package them, we uploaded the script to the lambda environment and noted that the remaining step was to import and use our trained ML models.

We were confronted with a harsh reality once again: Lambda functions have restrictive storage limits. By reading Lambda’s official documentation, we noted that each functions deployment packages has a maximum size of 250 MB while both our ML models were 500 MB and 600 MB.

Please note that this might not be accurate anymore since Lambda supports EFS now .

Once again, we tripped over the same stone: We did not take the time to read and learn about the limitations of the products we wanted to use. We were, once more, blindly guided by the hype and great use cases delivered by this Function-as-a-Service product. In this case, we overestimated our knowledge of the technology based on great results we had from previous experiences.

What should have been done before trying to port our code? Do some basic research and understand if the product is adapted to our needs. It appears that selecting the most popular product can sometimes mean making a very poor technological choice. See for example — what we should have read before even starting — “Why we don’t use Lambda for serverless machine learning” .

IV. Underestimate the difficulty and overestimate your knowledge

This project was driven by the belief that we knew how to make things “properly” even though we had very little experience on the fields we were addressing (i.e. AI and Cloud). The fourth and last capital mistake is in fact the cause of the other mistakes: We underestimated the difficulty of delivering production-ready products while being overly confident that we had sufficient knowledge on how things should be done based on what we read and hear about the technologies.

We started our project having a look at large data sets available online. We mainly searched through Kaggle and found a model with 10.000 faces labeled by their ages. “That’s great!” — we thought — but when we started deep diving a bit into the data we realized that the road to create a good predictor was not going to be as peaceful as we thought.

First mistake, underestimating how difficult it would be to obtain a proper dataset for free.

Amount of data:Although we never considered ourselves experts on the field of AI, we knew that having only 10.000 faces from 100 different targets would not be sufficient to train an accurate model. In this context, we realized that the amount of data was insufficient. In addition, when we analyzed the amount of faces per age, we realized that the dataset was completely unbalanced. Although we tried to balance the dataset by grouping the faces by range of age, the result remained very unbalanced. We noted that 90% of the available pictures were from 30 to 50 years old individuals.

Quality of data:In addition to the previous issue, we discovered that the data was biased. We noted that most of the pictures were pictures from celebrities. We knew that we could not use pictures from Keanu Reeves and hope that the average looking person looks as young as him. On top of that, we noted that some images were cropped and would show a fairly incomplete part of the face.

Lesson learned: Before jumping into training the neural network with a dataset, take the time to ensure that the data you have downloaded has sufficient quality and quantity of samples.

Second mistake, overestimating our knowledge on AI and Cloud and the target solution.

As very curious people, we have always tried to learn as much as possible about everything in the shortest amount of time. In this context, the concepts of Cloud and Artificial Intelligence were two topics on which we spent a significant amount of time reading about the great things companies and startups make, but a very limited amount of time to understand and deep dive into the technology being used.

In essence, we could say that our knowledge of the technologies we wanted to work with was very low, although we felt like we knew a lot. This could be explained by the below graph, which we are sure is very familiar to many of you:

Dunning-Kruger Effect ( source )

Without spending too much time analyzing this graph, we can confirm that at the beginning of this project we felt somewhere around the “Peak of mount stupid”, and that, over time, we were able to enter the Valley of Dispair. This project helped us realize how little we knew about Machine Learning.

Truth is, selecting the right piece of technology and the right data to begin with is not always simple. There is a significant gap between the creation of a working Proof of Concept, and the deployment of a user-friendly and “final” product.

The result? Google Cloud, Virtual Machines and Docker!

With the above picture you will see a good description of our expectations against the reality. The result of our multiple mistakes and changes led us to the following “final” result:

  • Instead of training a model with our own dataset, we ended up using a trained model with a better dataset;
  • Instead of porting our back-end to SageMaker, and then to Lambda, we used docker containers to serve a Flask API;
  • Instead of hosting our back-end on AWS, we ended up using Google Cloud and the 300€ provided as part of an account’s free tier.

By doing a self-assessment against our main objectives, we reached the following conclusions:

  • Follow “state of the art” practices to train our model and deploy our product: Although our solution is not the one we expected, we tried to use the most robust and relevant technology. In that sense, we decided to use a popular and robust trained model rather than our own. Regarding the front-end, it is stored in an S3 bucket served by a CloudFront CDN. This means that tens of thousands of users could access the front-end without having to scale anything on our side. Looking at the back-end, we decided to containerize our flask API and serve it with an NGINX container in order to make our app replicable and portable. We consider ourselves sufficiently satisfied with the result!
  • Have a running cost of 0€ per month : We actually ended up cheating on this point since our back-end is running in a Virtual Machine in Google Cloud. However, thanks to the free tier account, we will be able to run it for free for 1 year at least. We’ll say that we partially validated this point.
  • Have a product which could scale “infinitely” thanks to serverless services: We pretty much failed at this point. But we are confident that serverless solutions could still meet our needs in the future. Let’s hope not many people read this article!
  • Build a user-friendly “final product” allowing people to have fun with it without any explanation: We will let you decide for ourselves! Feel free to visit https://www.face2age.app and tell us if the UX is seems appropriate to you!
  • Enjoy the building process: Thankfully, this was the point we covered the most! We strongly recommend jumping in a small project with a friend or colleague of yours to learn new things!

Last words and what’s next

During the building process of this project, we noted that the mistakes we kept making were smaller and less time consuming since our ability to find pragmatic solutions grew. Below we listed the lessons learnt from this project:

  • Before jumping straight into coding, perform multiple google searches and compare other solutions to what you want to make;
  • After drafting or architecting a solution, take some time to think about possible limitations and problems that might occur. Usually, these arise whenever you make an assumption without a solid previous experience;
  • If you have never done something before, avoid feeling confident about your plan and multiply your time expectations by 3.

So whats next?

Although we are quite happy with our solution result and the results achieved with this project, we are aware that the solution is far away from being “production-ready” since we are currently hosting our model in a single (and not very performant) Virtual Machine. This means that if our project becomes suddenly popular, it will basically crash. Multiple solutions could be used to ensure the project is scalable for a very limited cost. Below are some of our thoughts:

  • Replace the back-end with an already existing solution. We could for example use AWS Rekognition and have 5000 free monthly predictions.
  • Migrate the back-end to a PaaS product such as AWS ECS or Fargate which would allow us to scale according to demand. However, this solution would translate into higher costs and feels a bit “overkill”.

About us

I am Jonathan Bernales , a Cyber Security consultant working at Deloitte Luxembourg, I mainly work for the Financial Services Industry and I am passionate about how technology and security can transform industries and create business opportunities over time.

I am Manuel Cubillo , a Software Engineer who loves technology and the way that it can impact on our lives. I am passionate about artificial intelligence, cloud and software development. I am always looking to learn something new. On my free time, I enjoy sports and reading.

Feel free to get in touch with us shall you have any question or would like to have a tutorial about this project!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK