Jekyll2021-07-21T07:19:50-07:00https://adekunleba.github.io/feed.xmlAdekunle BabatundeData Scientist and Machine Learning EngineerAdekunle Babatundeadekunleba@gmail.comMachine Learning Operations At Scale (part 2)2021-06-27T00:00:00-07:002021-06-27T00:00:00-07:00https://adekunleba.github.io/Machine-learning-operations-at-scale-(Part-2)<h3 id="setting-up-an-orchestration-engine-for-machine-learning-operations-with-kubernetes-and-kubeflow">Setting up an Orchestration Engine for Machine learning operations with Kubernetes and Kubeflow.</h3>
<p>Hello!</p>
<p>This is the Part 2 of the Machine learning operations at scale article, the part 1 is basically an introduction and can be found <a href="https://adekunleba.github.io/Machine-learning-Operations-at-scale-(Part-1)/">here</a>.</p>
<p>In this article, I will highlight setting up the orchestration environment which I believe is one of the most important environment to actually do MLOPs.</p>
<h3 id="importance-of-orchestration">Importance of Orchestration</h3>
<p>Orchestration allows you to build what is called a <strong>Pipeline</strong> for machine learning models. This is a fancy word for chaining together the various stages of machine learning steps first part which is the data science stage comprising of data extraction, feature engineering, model building, hyperparameter tuning and a second part which is the actual production consideration which includes, data versioning, monitoring, model deployment and model versioning.</p>
<p>There are several tools that people use for orchestration. Some allows you to do a more robust orchestration example of which is Kubeflow since it also relies on the capability of Kubernetes hence you also have the orchestration capability of Kubernetes for free. Others are a little easy to get by, at least they try to keep things simple compared to managing Kubernetes cluster. I am thinking of Airflow, Luigi and Mlflow in this category.</p>
<p>Hello Kubeflow!</p>
<figure>
<img src="/images/kubeflow.png" />
<figcaption><a href="https://cloud.google.com/blog/products/ai-machine-learning/getting-started-kubeflow-pipelines">Source</a></figcaption>
</figure>
<p>With Airflow though, you can easily create an end to end pipeline of whatever project you want that may be ran <code class="language-plaintext highlighter-rouge">even via Docker</code> (as this is not always necessary but useful in case you need it too) and once you deploy your pipeline, you can also trigger it with maximum flexibility depending on your engineering prowess. With Airflow orchestration though, you will need some important software development skills - who doesn’t need this anyways.</p>
<p>In general, orchestration mechanism is how you create a pipeline and chain your machine learning process for each single model that you would want to deploy and maintain in production.</p>
<h3 id="setting-up-local-kubeflow">Setting up local Kubeflow.</h3>
<p>For setting up our environment, the first thing we need to get out of the way is the bone behind kubeflow which is Kubernetes. What is Kubernetes? Kubernetes is an engine for managing container environments to help achieve provisioning <strong>app</strong> deployments at scale using something as basic as a yaml file configuration.</p>
<p><em>Side Note: The fact that Kubernetes is provisioned with yaml file doesn’t make it easy to manage. Management of kubernetes can be tedius and that leaves us to using it only when you are convinced it is the way to go.</em></p>
<p>If you are using Docker-desktop on Windows or MacOs, that is a very good place to start as you can easily initiate a single node kubernetes cluster with your docker desktop. The process is actually quite simple. This set up is officially documented on <a href="https://docs.docker.com/desktop/kubernetes/">docker</a>.</p>
<p>For linux users, an alternative is to use either one of the following tools:</p>
<ul>
<li>Kind</li>
<li>Minikube</li>
<li>k3s</li>
</ul>
<p>Setup for minikube is also quite straight forward.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-LO</span> https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
<span class="nb">sudo install </span>minikube-linux-amd64 /usr/local/bin/minikube
</code></pre></div></div>
<p>And we can create a cluster with</p>
<p><code class="language-plaintext highlighter-rouge">minikube start</code></p>
<p>For further details on installing a local kubernetes cluster with minikube in case you encounter an issue, <a href="https://minikube.sigs.k8s.io/docs/start/">check here</a>.</p>
<p>Similar to minikube, the setup for kind is also very straight forward:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-Lo</span> ./kind https://kind.sigs.k8s.io/dl/v0.11.1/kind-linux-amd64
<span class="nb">chmod</span> +x ./kind
<span class="nb">mv</span> ./kind /some-dir-in-your-PATH/kind
</code></pre></div></div>
<p>And you can start a new cluster with
<code class="language-plaintext highlighter-rouge">kind create cluster --name your-cluster-name</code></p>
<p>Refer to the <a href="https://kind.sigs.k8s.io/docs/user/quick-start">kind documentation</a> for further issue resolution in case of any issues.</p>
<p>The above setups needs docker as its backend to help create the kubernetes deployment and they are a little resource intensive and takes time when creating the cluster hence you have to at least be good with RAM and Storage space.</p>
<p>Wow, that was a lot to take in, although k3s boast a somewhat light weight kubernetes deployment however we will not delve into that.</p>
<p>Once done with the kubernetes deployment, the next thing is to deploy kubeflow on our kubernetes cluster.</p>
<p>Important concept to know in the world of kubernetes is the <code class="language-plaintext highlighter-rouge">namespace</code> concept the <code class="language-plaintext highlighter-rouge">kind</code> of application which could be <code class="language-plaintext highlighter-rouge">Deployment</code>, <code class="language-plaintext highlighter-rouge">Service</code> and some other others. This are important when creating the configuration for your application. Interestingly some tools already created this configuration and that leaves us with just applying them to our kubernetes cluster.</p>
<h3 id="setting-up-kubeflow-for-kubernetes">Setting up Kubeflow for kubernetes.</h3>
<p>You will need to download the the manifests for kubeflow. A best approach was to clone the <a href="https://github.com/kubeflow/pipelines">kubeflow pipeline repository</a>, this will allow you have access to the needed manifest to run the below commands</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl apply -k "kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns
</code></pre></div></div>
<p>The above code often takes a little time, but if all goes well which in many case will if you have adequate resource, you should have a kubeflow environment running.</p>
<p>Remember, the essence of this process is generally to have our orchestration engine that will help manage several machine learning model pipelines.</p>
<p>You can verify that the kubeflow environment is working perfectly by port-forwarding the running kubeflow on kubernetes to the local system:</p>
<p><code class="language-plaintext highlighter-rouge">kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80</code></p>
<p>Once this is done, we can play around creating simple pipeline configurations from python or uploading a yaml file to our kubeflow UI.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""
End to end simple pipeline
"""</span>
<span class="kn">import</span> <span class="nn">kfp</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">from</span> <span class="nn">kfp</span> <span class="kn">import</span> <span class="n">components</span>
<span class="k">def</span> <span class="nf">merge_csv</span><span class="p">(</span><span class="n">file_path</span><span class="p">:</span> <span class="n">components</span><span class="p">.</span><span class="n">InputPath</span><span class="p">(</span><span class="s">'Tarball'</span><span class="p">),</span>
<span class="n">output_csv</span><span class="p">:</span> <span class="n">components</span><span class="p">.</span><span class="n">OutputPath</span><span class="p">(</span><span class="s">'CSV'</span><span class="p">)):</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">tarfile</span>
<span class="n">tarfile</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="n">file_path</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s">"r|gz"</span><span class="p">).</span><span class="n">extractall</span><span class="p">(</span><span class="s">'data'</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span>
<span class="p">[</span><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">csv_file</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
<span class="k">for</span> <span class="n">csv_file</span> <span class="ow">in</span> <span class="n">glob</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">'data/*.csv'</span><span class="p">)])</span>
<span class="n">df</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">output_csv</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c1"># Web downloader component - Can be reused anywhere
</span><span class="n">web_downloader_op</span> <span class="o">=</span> <span class="n">kfp</span><span class="p">.</span><span class="n">components</span><span class="p">.</span><span class="n">load_component_from_url</span><span class="p">(</span>
<span class="n">url</span><span class="o">=</span> <span class="s">'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/web/Download/component.yaml'</span>
<span class="p">)</span>
<span class="n">create_step_merge_csv</span> <span class="o">=</span> <span class="n">kfp</span><span class="p">.</span><span class="n">components</span><span class="p">.</span><span class="n">create_component_from_func</span><span class="p">(</span>
<span class="n">func</span><span class="o">=</span><span class="n">merge_csv</span><span class="p">,</span>
<span class="n">output_component_file</span><span class="o">=</span><span class="s">'component.yaml'</span><span class="p">,</span>
<span class="n">base_image</span><span class="o">=</span><span class="s">'python:3.8'</span><span class="p">,</span>
<span class="n">packages_to_install</span><span class="o">=</span><span class="p">[</span><span class="s">'pandas==1.1.4'</span><span class="p">]</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">example_pipeline</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">web_downloader_task</span> <span class="o">=</span> <span class="n">web_downloader_op</span><span class="p">(</span><span class="n">url</span><span class="o">=</span><span class="n">url</span><span class="p">)</span>
<span class="n">create_step_merge_csv</span><span class="p">(</span><span class="nb">file</span><span class="o">=</span><span class="n">web_downloader_task</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="s">'data'</span><span class="p">])</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">kfp</span><span class="p">.</span><span class="n">Client</span><span class="p">(</span>
<span class="n">host</span><span class="o">=</span><span class="s">"http://localhost:8080"</span>
<span class="p">)</span>
<span class="n">client</span><span class="p">.</span><span class="n">create_run_from_pipeline_func</span><span class="p">(</span>
<span class="n">pipeline_func</span><span class="o">=</span><span class="n">example_pipeline</span><span class="p">,</span>
<span class="n">arguments</span><span class="o">=</span><span class="p">{</span>
<span class="s">'url'</span> <span class="p">:</span> <span class="s">'https://storage.googleapis.com/ml-pipeline-playground/iris-csv-files.tar.gz'</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="n">logging</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Done creating pipeline..."</span><span class="p">)</span>
<span class="c1"># You can auto create a yaml configuration of your pipeline with this.
</span> <span class="n">kfp</span><span class="p">.</span><span class="n">compiler</span><span class="p">.</span><span class="n">Compiler</span><span class="p">().</span><span class="nb">compile</span><span class="p">(</span>
<span class="n">pipeline_func</span><span class="o">=</span><span class="n">example_pipeline</span><span class="p">,</span>
<span class="n">package_path</span><span class="o">=</span><span class="s">'pipeline_starter/pipeline.yaml'</span><span class="p">)</span>
</code></pre></div></div>
<p>And the final pipeline yaml will eventually look like this</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">argoproj.io/v1alpha1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Workflow</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">generateName</span><span class="pi">:</span> <span class="s">example-pipeline-</span>
<span class="na">annotations</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">pipelines.kubeflow.org/kfp_sdk_version</span><span class="pi">:</span> <span class="nv">1.6.4</span><span class="pi">,</span> <span class="nv">pipelines.kubeflow.org/pipeline_compilation_time</span><span class="pi">:</span> <span class="s1">'</span><span class="s">2021-07-04T15:25:19.247514'</span><span class="pi">,</span>
<span class="nv">pipelines.kubeflow.org/pipeline_spec</span><span class="pi">:</span> <span class="s1">'</span><span class="s">{"inputs":</span><span class="nv"> </span><span class="s">[{"name":</span><span class="nv"> </span><span class="s">"url"}],</span><span class="nv"> </span><span class="s">"name":</span><span class="nv"> </span><span class="s">"Example</span>
<span class="s">pipeline"}'</span><span class="pi">}</span>
<span class="na">labels</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">pipelines.kubeflow.org/kfp_sdk_version</span><span class="pi">:</span> <span class="nv">1.6.4</span><span class="pi">}</span>
<span class="na">spec</span><span class="pi">:</span>
<span class="na">entrypoint</span><span class="pi">:</span> <span class="s">example-pipeline</span>
<span class="na">templates</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">download-data</span>
<span class="na">container</span><span class="pi">:</span>
<span class="na">args</span><span class="pi">:</span> <span class="pi">[]</span>
<span class="na">command</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">sh</span>
<span class="pi">-</span> <span class="s">-exc</span>
<span class="pi">-</span> <span class="pi">|</span>
<span class="s">url="$0"</span>
<span class="s">output_path="$1"</span>
<span class="s">curl_options="$2"</span>
<span class="s">mkdir -p "$(dirname "$output_path")"</span>
<span class="s">curl --get "$url" --output "$output_path" $curl_options</span>
<span class="pi">-</span> <span class="s1">'</span><span class="s">'</span>
<span class="pi">-</span> <span class="s">/tmp/outputs/Data/data</span>
<span class="pi">-</span> <span class="s">--location</span>
<span class="na">image</span><span class="pi">:</span> <span class="s">byrnedo/alpine-curl@sha256:548379d0a4a0c08b9e55d9d87a592b7d35d9ab3037f4936f5ccd09d0b625a342</span>
<span class="na">inputs</span><span class="pi">:</span>
<span class="na">parameters</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">url</span><span class="pi">}</span>
<span class="na">outputs</span><span class="pi">:</span>
<span class="na">artifacts</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">download-data-Data</span><span class="pi">,</span> <span class="nv">path</span><span class="pi">:</span> <span class="nv">/tmp/outputs/Data/data</span><span class="pi">}</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">annotations</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">author</span><span class="pi">:</span> <span class="nv">Alexey Volkov <alexey.volkov@ark-kun.com></span><span class="pi">,</span> <span class="nv">pipelines.kubeflow.org/component_spec</span><span class="pi">:</span> <span class="s1">'</span><span class="s">{"implementation":</span>
<span class="s">{"container":</span><span class="nv"> </span><span class="s">{"command":</span><span class="nv"> </span><span class="s">["sh",</span><span class="nv"> </span><span class="s">"-exc",</span><span class="nv"> </span><span class="s">"url=\"$0\"\noutput_path=\"$1\"\ncurl_options=\"$2\"\n\nmkdir</span>
<span class="s">-p</span><span class="nv"> </span><span class="s">\"$(dirname</span><span class="nv"> </span><span class="s">\"$output_path\")\"\ncurl</span><span class="nv"> </span><span class="s">--get</span><span class="nv"> </span><span class="s">\"$url\"</span><span class="nv"> </span><span class="s">--output</span><span class="nv"> </span><span class="s">\"$output_path\"</span>
<span class="s">$curl_options\n",</span><span class="nv"> </span><span class="s">{"inputValue":</span><span class="nv"> </span><span class="s">"Url"},</span><span class="nv"> </span><span class="s">{"outputPath":</span><span class="nv"> </span><span class="s">"Data"},</span><span class="nv"> </span><span class="s">{"inputValue":</span>
<span class="s">"curl</span><span class="nv"> </span><span class="s">options"}],</span><span class="nv"> </span><span class="s">"image":</span><span class="nv"> </span><span class="s">"byrnedo/alpine-curl@sha256:548379d0a4a0c08b9e55d9d87a592b7d35d9ab3037f4936f5ccd09d0b625a342"}},</span>
<span class="s">"inputs":</span><span class="nv"> </span><span class="s">[{"name":</span><span class="nv"> </span><span class="s">"Url",</span><span class="nv"> </span><span class="s">"type":</span><span class="nv"> </span><span class="s">"URI"},</span><span class="nv"> </span><span class="s">{"default":</span><span class="nv"> </span><span class="s">"--location",</span><span class="nv"> </span><span class="s">"description":</span>
<span class="s">"Additional</span><span class="nv"> </span><span class="s">options</span><span class="nv"> </span><span class="s">given</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">the</span><span class="nv"> </span><span class="s">curl</span><span class="nv"> </span><span class="s">bprogram.</span><span class="nv"> </span><span class="s">See</span><span class="nv"> </span><span class="s">https://curl.haxx.se/docs/manpage.html",</span>
<span class="s">"name":</span><span class="nv"> </span><span class="s">"curl</span><span class="nv"> </span><span class="s">options",</span><span class="nv"> </span><span class="s">"type":</span><span class="nv"> </span><span class="s">"string"}],</span><span class="nv"> </span><span class="s">"metadata":</span><span class="nv"> </span><span class="s">{"annotations":</span>
<span class="s">{"author":</span><span class="nv"> </span><span class="s">"Alexey</span><span class="nv"> </span><span class="s">Volkov</span><span class="nv"> </span><span class="s"><alexey.volkov@ark-kun.com>"}},</span><span class="nv"> </span><span class="s">"name":</span><span class="nv"> </span><span class="s">"Download</span>
<span class="s">data",</span><span class="nv"> </span><span class="s">"outputs":</span><span class="nv"> </span><span class="s">[{"name":</span><span class="nv"> </span><span class="s">"Data"}]}'</span><span class="pi">,</span> <span class="nv">pipelines.kubeflow.org/component_ref</span><span class="pi">:</span> <span class="s1">'</span><span class="s">{"digest":</span>
<span class="s">"25738efc20b7c1bfeb792f872d2ccf88097f15f479a36674d712da20290bf79a",</span><span class="nv"> </span><span class="s">"url":</span>
<span class="s">"https://raw.githubusercontent.com/kubeflow/pipelines/master/components/web/Download/component.yaml"}'</span><span class="pi">,</span>
<span class="nv">pipelines.kubeflow.org/arguments.parameters</span><span class="pi">:</span> <span class="s1">'</span><span class="s">{"Url":</span><span class="nv"> </span><span class="s">"",</span>
<span class="s">"curl</span><span class="nv"> </span><span class="s">options":</span><span class="nv"> </span><span class="s">"--location"}'</span><span class="pi">}</span>
<span class="na">labels</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">pipelines.kubeflow.org/kfp_sdk_version</span><span class="pi">:</span> <span class="nv">1.6.4</span><span class="pi">,</span> <span class="nv">pipelines.kubeflow.org/pipeline-sdk-type</span><span class="pi">:</span> <span class="nv">kfp</span><span class="pi">}</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">example-pipeline</span>
<span class="na">inputs</span><span class="pi">:</span>
<span class="na">parameters</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">url</span><span class="pi">}</span>
<span class="na">dag</span><span class="pi">:</span>
<span class="na">tasks</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">download-data</span>
<span class="na">template</span><span class="pi">:</span> <span class="s">download-data</span>
<span class="na">arguments</span><span class="pi">:</span>
<span class="na">parameters</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">url</span><span class="pi">,</span> <span class="nv">value</span><span class="pi">:</span> <span class="s1">'</span><span class="s">'</span><span class="pi">}</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">merge-csv</span>
<span class="na">template</span><span class="pi">:</span> <span class="s">merge-csv</span>
<span class="na">dependencies</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">download-data</span><span class="pi">]</span>
<span class="na">arguments</span><span class="pi">:</span>
<span class="na">artifacts</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">download-data-Data</span><span class="pi">,</span> <span class="nv">from</span><span class="pi">:</span> <span class="s1">'</span><span class="s">'</span><span class="pi">}</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">merge-csv</span>
<span class="na">container</span><span class="pi">:</span>
<span class="na">args</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">--file</span><span class="pi">,</span> <span class="nv">/tmp/inputs/file/data</span><span class="pi">,</span> <span class="nv">--output-csv</span><span class="pi">,</span> <span class="nv">/tmp/outputs/output_csv/data</span><span class="pi">]</span>
<span class="na">command</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">sh</span>
<span class="pi">-</span> <span class="s">-c</span>
<span class="pi">-</span> <span class="s">(PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location</span>
<span class="s">'pandas==1.1.4' || PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install</span>
<span class="s">--quiet --no-warn-script-location 'pandas==1.1.4' --user) && "$0" "$@"</span>
<span class="pi">-</span> <span class="s">sh</span>
<span class="pi">-</span> <span class="s">-ec</span>
<span class="pi">-</span> <span class="pi">|</span>
<span class="s">program_path=$(mktemp)</span>
<span class="s">printf "%s" "$0" > "$program_path"</span>
<span class="s">python3 -u "$program_path" "$@"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">def</span><span class="nv"> </span><span class="s">_make_parent_dirs_and_return_path(file_path:</span><span class="nv"> </span><span class="s">str):</span><span class="se">\n</span><span class="nv"> </span><span class="s">import</span><span class="nv"> </span><span class="s">os</span><span class="se">\n</span><span class="nv"> </span><span class="se">\
</span> <span class="se">\ </span><span class="nv"> </span><span class="s">os.makedirs(os.path.dirname(file_path),</span><span class="nv"> </span><span class="s">exist_ok=True)</span><span class="se">\n</span><span class="nv"> </span><span class="s">return</span><span class="nv"> </span><span class="s">file_path</span><span class="se">\n\
</span> <span class="se">\n</span><span class="s">def</span><span class="nv"> </span><span class="s">merge_csv(file_path,</span><span class="se">\n</span><span class="nv"> </span><span class="s">output_csv):</span><span class="se">\n</span><span class="nv"> </span><span class="s">import</span><span class="nv"> </span><span class="s">glob</span><span class="se">\n</span><span class="nv"> </span><span class="s">import</span><span class="se">\
</span> <span class="se">\ </span><span class="s">pandas</span><span class="nv"> </span><span class="s">as</span><span class="nv"> </span><span class="s">pd</span><span class="se">\n</span><span class="nv"> </span><span class="s">import</span><span class="nv"> </span><span class="s">tarfile</span><span class="se">\n\n</span><span class="nv"> </span><span class="s">tarfile.open(name=file_path,</span><span class="nv"> </span><span class="s">mode=</span><span class="se">\"\
</span> <span class="s">r|gz</span><span class="se">\"</span><span class="s">).extractall('data')</span><span class="se">\n</span><span class="nv"> </span><span class="s">df</span><span class="nv"> </span><span class="s">=</span><span class="nv"> </span><span class="s">pd.concat(</span><span class="se">\n</span><span class="nv"> </span><span class="s">[pd.read_csv(csv_file,</span><span class="se">\
</span> <span class="se">\ </span><span class="s">header=None)</span><span class="nv"> </span><span class="se">\n</span><span class="nv"> </span><span class="s">for</span><span class="nv"> </span><span class="s">csv_file</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">glob.glob('data/*.csv')])</span><span class="se">\n</span><span class="nv"> </span><span class="s">df.to_csv(output_csv,</span><span class="se">\
</span> <span class="se">\ </span><span class="s">index=False,</span><span class="nv"> </span><span class="s">header=False)</span><span class="se">\n\n</span><span class="s">import</span><span class="nv"> </span><span class="s">argparse</span><span class="se">\n</span><span class="s">_parser</span><span class="nv"> </span><span class="s">=</span><span class="nv"> </span><span class="s">argparse.ArgumentParser(prog='Merge</span><span class="se">\
</span> <span class="se">\ </span><span class="s">csv',</span><span class="nv"> </span><span class="s">description='')</span><span class="se">\n</span><span class="s">_parser.add_argument(</span><span class="se">\"</span><span class="s">--file</span><span class="se">\"</span><span class="s">,</span><span class="nv"> </span><span class="s">dest=</span><span class="se">\"</span><span class="s">file_path</span><span class="se">\"\
</span> <span class="s">,</span><span class="nv"> </span><span class="s">type=str,</span><span class="nv"> </span><span class="s">required=True,</span><span class="nv"> </span><span class="s">default=argparse.SUPPRESS)</span><span class="se">\n</span><span class="s">_parser.add_argument(</span><span class="se">\"\
</span> <span class="s">--output-csv</span><span class="se">\"</span><span class="s">,</span><span class="nv"> </span><span class="s">dest=</span><span class="se">\"</span><span class="s">output_csv</span><span class="se">\"</span><span class="s">,</span><span class="nv"> </span><span class="s">type=_make_parent_dirs_and_return_path,</span><span class="se">\
</span> <span class="se">\ </span><span class="s">required=True,</span><span class="nv"> </span><span class="s">default=argparse.SUPPRESS)</span><span class="se">\n</span><span class="s">_parsed_args</span><span class="nv"> </span><span class="s">=</span><span class="nv"> </span><span class="s">vars(_parser.parse_args())</span><span class="se">\n\
</span> <span class="se">\n</span><span class="s">_outputs</span><span class="nv"> </span><span class="s">=</span><span class="nv"> </span><span class="s">merge_csv(**_parsed_args)</span><span class="se">\n</span><span class="s">"</span>
<span class="na">image</span><span class="pi">:</span> <span class="s">python:3.8</span>
<span class="na">inputs</span><span class="pi">:</span>
<span class="na">artifacts</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">download-data-Data</span><span class="pi">,</span> <span class="nv">path</span><span class="pi">:</span> <span class="nv">/tmp/inputs/file/data</span><span class="pi">}</span>
<span class="na">outputs</span><span class="pi">:</span>
<span class="na">artifacts</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">merge-csv-output_csv</span><span class="pi">,</span> <span class="nv">path</span><span class="pi">:</span> <span class="nv">/tmp/outputs/output_csv/data</span><span class="pi">}</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">labels</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">pipelines.kubeflow.org/kfp_sdk_version</span><span class="pi">:</span> <span class="nv">1.6.4</span><span class="pi">,</span> <span class="nv">pipelines.kubeflow.org/pipeline-sdk-type</span><span class="pi">:</span> <span class="nv">kfp</span><span class="pi">}</span>
<span class="na">annotations</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">pipelines.kubeflow.org/component_spec</span><span class="pi">:</span> <span class="s1">'</span><span class="s">{"implementation":</span><span class="nv"> </span><span class="s">{"container":</span>
<span class="s">{"args":</span><span class="nv"> </span><span class="s">["--file",</span><span class="nv"> </span><span class="s">{"inputPath":</span><span class="nv"> </span><span class="s">"file"},</span><span class="nv"> </span><span class="s">"--output-csv",</span><span class="nv"> </span><span class="s">{"outputPath":</span>
<span class="s">"output_csv"}],</span><span class="nv"> </span><span class="s">"command":</span><span class="nv"> </span><span class="s">["sh",</span><span class="nv"> </span><span class="s">"-c",</span><span class="nv"> </span><span class="s">"(PIP_DISABLE_PIP_VERSION_CHECK=1</span>
<span class="s">python3</span><span class="nv"> </span><span class="s">-m</span><span class="nv"> </span><span class="s">pip</span><span class="nv"> </span><span class="s">install</span><span class="nv"> </span><span class="s">--quiet</span><span class="nv"> </span><span class="s">--no-warn-script-location</span><span class="nv"> </span><span class="s">'</span><span class="s1">'</span><span class="s">pandas==1.1.4'</span><span class="s1">'</span>
<span class="s">||</span><span class="nv"> </span><span class="s">PIP_DISABLE_PIP_VERSION_CHECK=1</span><span class="nv"> </span><span class="s">python3</span><span class="nv"> </span><span class="s">-m</span><span class="nv"> </span><span class="s">pip</span><span class="nv"> </span><span class="s">install</span><span class="nv"> </span><span class="s">--quiet</span><span class="nv"> </span><span class="s">--no-warn-script-location</span>
<span class="s">'</span><span class="s1">'</span><span class="s">pandas==1.1.4'</span><span class="s1">'</span><span class="nv"> </span><span class="s">--user)</span><span class="nv"> </span><span class="s">&&</span><span class="nv"> </span><span class="s">\"$0\"</span><span class="nv"> </span><span class="s">\"$@\"",</span><span class="nv"> </span><span class="s">"sh",</span><span class="nv"> </span><span class="s">"-ec",</span><span class="nv"> </span><span class="s">"program_path=$(mktemp)\nprintf</span>
<span class="s">\"%s\"</span><span class="nv"> </span><span class="s">\"$0\"</span><span class="nv"> </span><span class="s">></span><span class="nv"> </span><span class="s">\"$program_path\"\npython3</span><span class="nv"> </span><span class="s">-u</span><span class="nv"> </span><span class="s">\"$program_path\"</span><span class="nv"> </span><span class="s">\"$@\"\n",</span>
<span class="s">"def</span><span class="nv"> </span><span class="s">_make_parent_dirs_and_return_path(file_path:</span><span class="nv"> </span><span class="s">str):\n</span><span class="nv"> </span><span class="s">import</span><span class="nv"> </span><span class="s">os\n</span><span class="nv"> </span><span class="s">os.makedirs(os.path.dirname(file_path),</span>
<span class="s">exist_ok=True)\n</span><span class="nv"> </span><span class="s">return</span><span class="nv"> </span><span class="s">file_path\n\ndef</span><span class="nv"> </span><span class="s">merge_csv(file_path,\n</span><span class="nv"> </span><span class="s">output_csv):\n</span><span class="nv"> </span><span class="s">import</span>
<span class="s">glob\n</span><span class="nv"> </span><span class="s">import</span><span class="nv"> </span><span class="s">pandas</span><span class="nv"> </span><span class="s">as</span><span class="nv"> </span><span class="s">pd\n</span><span class="nv"> </span><span class="s">import</span><span class="nv"> </span><span class="s">tarfile\n\n</span><span class="nv"> </span><span class="s">tarfile.open(name=file_path,</span>
<span class="s">mode=\"r|gz\").extractall('</span><span class="s1">'</span><span class="s">data'</span><span class="s1">'</span><span class="s">)\n</span><span class="nv"> </span><span class="s">df</span><span class="nv"> </span><span class="s">=</span><span class="nv"> </span><span class="s">pd.concat(\n</span><span class="nv"> </span><span class="s">[pd.read_csv(csv_file,</span>
<span class="s">header=None)</span><span class="nv"> </span><span class="s">\n</span><span class="nv"> </span><span class="s">for</span><span class="nv"> </span><span class="s">csv_file</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">glob.glob('</span><span class="s1">'</span><span class="s">data/*.csv'</span><span class="s1">'</span><span class="s">)])\n</span><span class="nv"> </span><span class="s">df.to_csv(output_csv,</span>
<span class="s">index=False,</span><span class="nv"> </span><span class="s">header=False)\n\nimport</span><span class="nv"> </span><span class="s">argparse\n_parser</span><span class="nv"> </span><span class="s">=</span><span class="nv"> </span><span class="s">argparse.ArgumentParser(prog='</span><span class="s1">'</span><span class="s">Merge</span>
<span class="s">csv'</span><span class="s1">'</span><span class="s">,</span><span class="nv"> </span><span class="s">description='</span><span class="s1">'</span><span class="s">'</span><span class="s1">'</span><span class="s">)\n_parser.add_argument(\"--file\",</span><span class="nv"> </span><span class="s">dest=\"file_path\",</span>
<span class="s">type=str,</span><span class="nv"> </span><span class="s">required=True,</span><span class="nv"> </span><span class="s">default=argparse.SUPPRESS)\n_parser.add_argument(\"--output-csv\",</span>
<span class="s">dest=\"output_csv\",</span><span class="nv"> </span><span class="s">type=_make_parent_dirs_and_return_path,</span><span class="nv"> </span><span class="s">required=True,</span>
<span class="s">default=argparse.SUPPRESS)\n_parsed_args</span><span class="nv"> </span><span class="s">=</span><span class="nv"> </span><span class="s">vars(_parser.parse_args())\n\n_outputs</span>
<span class="s">=</span><span class="nv"> </span><span class="s">merge_csv(**_parsed_args)\n"],</span><span class="nv"> </span><span class="s">"image":</span><span class="nv"> </span><span class="s">"python:3.8"}},</span><span class="nv"> </span><span class="s">"inputs":</span><span class="nv"> </span><span class="s">[{"name":</span>
<span class="s">"file",</span><span class="nv"> </span><span class="s">"type":</span><span class="nv"> </span><span class="s">"Tarball"}],</span><span class="nv"> </span><span class="s">"name":</span><span class="nv"> </span><span class="s">"Merge</span><span class="nv"> </span><span class="s">csv",</span><span class="nv"> </span><span class="s">"outputs":</span><span class="nv"> </span><span class="s">[{"name":</span><span class="nv"> </span><span class="s">"output_csv",</span>
<span class="s">"type":</span><span class="nv"> </span><span class="s">"CSV"}]}'</span><span class="pi">,</span> <span class="nv">pipelines.kubeflow.org/component_ref</span><span class="pi">:</span> <span class="s1">'</span><span class="s">{}'</span><span class="pi">}</span>
<span class="na">arguments</span><span class="pi">:</span>
<span class="na">parameters</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span><span class="nv">name</span><span class="pi">:</span> <span class="nv">url</span><span class="pi">}</span>
<span class="na">serviceAccountName</span><span class="pi">:</span> <span class="s">pipeline-runner</span>
</code></pre></div></div>
<p>We can easily upload this to kubeflow and we techically have a pipeline.</p>
<p>It is important to mention at this point that this is the foundation of our MLOps pipeline, as we can basically chain various codes and stages together as the above for our running kubeflow environment.</p>
<p>Having set up our environment, our focus now is on Continous integration and deployment of our application and pipeline. I choose a tool called <code class="language-plaintext highlighter-rouge">ArgoCD</code> for the continous deployment while using <code class="language-plaintext highlighter-rouge">Github Actioins</code> for Continouse integration. I believe seperating this two activities is more or less a best practice for Devops, well enough that the industry coined the word <code class="language-plaintext highlighter-rouge">Gitops</code> for that process of Continous deployment adopted by Argocd. Okay so to get the gist of what ArgoCD will do: Continous integration has technically been standardize to end at one of publishing your app as package somewhere or pushing your app as a docker container to the container registry. Once this process concludes, we can then update <em>the app registry</em> on github which is connected to a continous deployment applicaton running somewhere (in our case argocd running on the kubernetes cluster on which our app will run) and this update triggers automatic deployment of our application by getting the latest docker container and running it on the cluster or our specified cluster.</p>
<p>This is not only useful for our MLOps pipeline (basically the model training and deployment part), this is also important for the <em>DataOps</em> stage where we don’t necessarily have to connect it to the Kubeflow pipeline, speaking of this stage, we are speaking of the feature store management which will be deployed predominantly as an application using <code class="language-plaintext highlighter-rouge">Feast</code>.</p>
<p>In order not to over-cramp up this portion of the article, I will leave setting up ArgoCD and its usage to Part 4 of this set of articles. In the next 2 articles I will be deploying a feast feature store to kubernetes in a CI/CD verion to mimick how the machine learning engineer helps to cater for differing features of different datasets. This also fundametally lay the foundation for how we will be managing several datasets for several models to be run in production.</p>
<p>Conclusion:</p>
<p>We have come a long way on setting up our orchestration engine. Note that the process is very dependent on how much resource you have locally. I use a 16GB RAM MacOs for this setup. In case you are unable to do local deployment, you can leverage a cloud managed kubernetes cluster for this. It largely follows same deployment approach - not to even say you get a kubernetes cluster managed for you on setup from most of the cloud providers.</p>
<blockquote>
<p>If your organization is intending to scale your machine learning pipeline, or having difficulty taking machine learning models into production properly, you can email me at adekunleba@gmail.com for some guidance, i will be more than willing to listen and provide information that most likely will be of help.</p>
</blockquote>Adekunle Babatundeadekunleba@gmail.comSetting up an Orchestration Engine for Machine learning operations with Kubernetes and Kubeflow.Machine Learning Operations At Scale (part 1)2021-05-30T00:00:00-07:002021-05-30T00:00:00-07:00https://adekunleba.github.io/Machine-learning-Operations-at-scale-(Part-1)<h3 id="introduction-to-mlops">Introduction to MLOPs</h3>
<p>Hello once again and welcome to my blog where I write about technology things I found impressive overtime.</p>
<p>I am currently going through an impressive stage following along with the current wave of <strong>Machine learning Operations (MLOps)</strong>. For those new to the space I will do a quick definition of what MLOps is.</p>
<p>In non-technical term think MLOPs as the process of deploying tens to thousands of several machine learning models for an organization or for a problem. Now, if you are an organization whose machine learning models are getting beyond a cap of say 10 models running in production, then there is the chance that you will need an MLOps practice to keep track as well as manage the models.</p>
<p>For official definition of MLOps - A perfect one will be the definition given by <strong>Wikipedia</strong></p>
<blockquote>
<p>MLOps or ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. The word is a compound of “machine learning” and the continuous development practice of DevOps in the software field</p>
</blockquote>
<p>The two main points here is <em>deployment</em> and <em>maintainance</em>, although I didn’t cater for maintainance in my non-technical definition, alas, I will say for any software that is deployed, rarely would you see it not maintained, hence the culture of maintanance is logically infused into MLOps per it being entirely as <em>process of deployment</em></p>
<h3 id="traditional-machine-learning-approach-vs-dataops-infused-mlops">Traditional Machine learning approach vs DataOps infused MLOps.</h3>
<p>Wondering what new term we are adding to MLOPs, well you got it DataOps, it is just as similar to MLOPs and usually I prefer to think it same, it is a process of managing data deployments for machine learning.</p>
<p>We all know machine learning models are rarely a thing without the data, however, for easy deployment and maintainance of models, organisations needed a way to manage the data as well as the features on which the models are trained on, giving rise to the need for DataOps. I will largely not delve into dataops as a thing on its own, I will take it up as part of the MLOps pipeline. Some side note, it is often safe and potentially un-nerving to try and seperate the feature engineering management and model deployment of a model. You can refer to this <a href="https://towardsdatascience.com/mlops-with-a-feature-store-816cfa5966e9">article</a> on an intuition of doing that perfectly.</p>
<p>Traditional machine learning have data scientists (possibly) downloading data from CSV or some data engineering dumping cleaned data to a database table, then the data scientist opens up a notebook and probably run several iterations of data exploration, feature engineering and model building. At the end of that cycle, oftentimes, the data scientist need to trace back to see which of the steps actually gave a good model, thanks to the concept of Pipeline as present in <em>sklean</em> and <em>spark-ml</em>, we can chain that process and effectively track and manage changes. Nevertheless beyond model building is using the model in production, and that’s where the challenging part of the work comes to play. We need to build probably a flask application on our saved model, also ensure that the feature engineering code is right for the incoming data and we can possibly create a docker environment and host our service as a microservice, or maybe deploy to a local environment - refer to a my article on <a href="https://adekunleba.github.io/Machine-learning-model-deployment-with-cpp-Part-1/">model deployment with c++ for example</a>. But here is where things get tricky, because a data scientist is not entirely proficient with the skills of managing applications in production, then the likelikhood of not properly guarding against the following question is high:</p>
<ul>
<li>how are we going to know when the model starts degrading</li>
<li>how do we know whether we are making good prediction on data or not</li>
<li>how often should we retrain the model</li>
<li>how to move from one model deployed to another project while tracking the old projects performance.</li>
</ul>
<p>This leads us to the space of MLOPs, where data scientists are basically left to experiments with their models and possibly create a good machine learning script which the machine learning engineer or operations person can literally scale with adequate monitoring and retraining process, leaving the data scientist to do what he does best, which is exploration of data and building models that works for <strong>several</strong> machine learning task.</p>
<h3 id="the-place-of-cicd-in-mlops">The place of CI/CD in MLOps</h3>
<p>Beyond being able to take a single code and deploying it, there is also the process of testing, continous integration and automatic deployment. While this process are largely still evolving for machine learning operations, there are few tricks that tends to work out of the box for now.</p>
<p>Continous Integration for machine learning codes is actually quite tricky because the heartbeat of continous integration is actually code test to ensure that upon adding new stuffs to the code, nothing is broken. However, machine learning tests are hard, because it is usually unclear what to test.</p>
<p>Nevetherless, we will see in this series of articles, some idea to code testing for machine learning operations.</p>
<h3 id="toolings">Toolings.</h3>
<p>I think the most important tool for MLOPs is usually the orchestration engine that manages the pipeline, hence this is the reason why tools such as Airflow, Kubeflow and MLFlow dominates the tooling landscape, also there various aspect of the machine learning pipeline has its specific tools which is still growing. There is <a href="https://github.com/kelvins/awesome-mlops">github awesome repo</a> for MLOPs that is what looking at.</p>
<p>In this series of article, I will be building a sample production application bringing the various aspect of MLOPs with some of the tools in used at a high scale production level.</p>
<p>PS: By all means you can decide to scale down or scale up on the processes involved here depending on the business case or organization capacity</p>
<h3 id="conclusion">Conclusion</h3>
<p>This is a basic introduction into a series of articles that covers MLOPs and how to do it.</p>
<p>Our first article will look at setting up a Kubeflow environment for our machine learning operations project.</p>
<blockquote>
<p>If your organization is intending to scale your machine learning pipeline, or having difficulty taking machine learning models into production properly, you can email me at adekunleba@gmail.com for some guidance, i will be more than willing to listen and provide information that most likely will be of help.</p>
</blockquote>Adekunle Babatundeadekunleba@gmail.comIntroduction to MLOPsRethinking Data Engineering At Scale2021-05-25T00:00:00-07:002021-05-25T00:00:00-07:00https://adekunleba.github.io/Rethinking-data-engineering-at-scale<p>I started 2021 with a focus on establishing myself in data engineering and building pipelines both for data science and machine learning operations. My last article was entirely me evaluating what goes on in the world of data engineering, the tools I have seen organiztion use and whether there is a case for build your own vs use an open source tool.</p>
<p>I recently had to do some data engineering task and as adviced in my earlier post that companies can initially start out with open source tools and if need or based on organizational requirements transition into building their in house data extraction tool.</p>
<p>After much research, I settled on two options (and since the focus is on open source), Apache Nifi and Apache Gobblin.</p>
<p>Apache Nifi is not deemed an ETL tool but rather a tool that is used for moving data at scale between systems. If we look at this from that point of view and what we do with data extraction, Apache Nifi looks like a valid tool for extracting data from various sources and moving it i.e the loading part of ETL. However, Apache Nifi is not so great at data transformation. Therefore I favour following the path of ELT when using Apache Nifi.</p>
<p>Apache Gobblin on the other hand is a new guy in the club of data extraction. It is a distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.</p>
<p>The major difference between the two as at now is that Apache Nifi has been here for long approximately 10 years while Gobblin is still quite new. Furthermore, Nifi is entirely a GUI drag and drop approach to building the data integration system while Gobblin is not, this is not to say it is however much more flexible for developers compared to Apache Nifi.</p>
<p>In terms of beginner friendly, in as much as I haven’t worked so much with Apache Gobblin, I found Apache Nifi slightly beginner friendly compared to Apache Gobblin and the fact that Apache Gobblin is only programmable with Java gives some limitation.</p>
<p>To add to this article is a big lesson learnt while working with Apache Nifi.</p>
<h4 id="apache-nifi-integration-with-apache-kafka">Apache Nifi Integration with Apache Kafka.</h4>
<p>In the world of ELT, there is a very high chance that Kafka will be in the loop, either as a temporary data storage for replication or as a general cache engine/data distribution engine.</p>
<p>Since Apache Nifi is a GUI based drag and drop configuration approach to building a data extracton pipeline, the most important part of the project is usually the point at which your properly input the right configration while dragging and dropping your processors.</p>
<p>I will write an introductory article to Apache Nifi where I will detail the various components of the engine.</p>
<h4 id="specifics-of-apache-nifi-and-apache-kafka-integration-that-worked">Specifics of Apache Nifi and Apache Kafka integration that worked.</h4>
<p>My aim here is to document a scalable integration with Apache Nifi and Apache Kafka that worked:</p>
<ol>
<li>
<p>The use of a schema registry which in my case <code class="language-plaintext highlighter-rouge">ConfluenceSchemaRegistry</code> made things quite smooth. I have initially tried to save the schema alongside the record but didn’t quite worked. With this approach, we can basically Access Schema using the <code class="language-plaintext highlighter-rouge">Schema Name property</code>.</p>
</li>
<li>
<p>When consuming the Kafka Record, I favour the ConsumeKafka to ConsumeKafkaRecord. The major difference between the two being that ConsumeKafkaRecord allows you to read the message with a schema. The challenge is usually when you are not sure about the schema name and you have to supply it dynamically, hence you can just extract the record from Kafka using ConsumeKafka and then have the flexibility of processing it at your own dynamics.</p>
</li>
<li>
<p>Also, when consuming data, you want to set the Offset of your consumer to be the <code class="language-plaintext highlighter-rouge">earliest</code> it will save a lot of debugging as you would expect consumption to happen immediately you fire up (start) your consumer, but because your offset is set to default <code class="language-plaintext highlighter-rouge">latest</code> it only fires up on new message after connecting. This might not be very obvious to someone just starting out with Apache Nifi.</p>
</li>
<li>
<p>Additionally, you can TailFile a log file to use as an example data source - a very interesting trick gotten from this article <a href="https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries">here</a></p>
</li>
</ol>Adekunle Babatundeadekunleba@gmail.comI started 2021 with a focus on establishing myself in data engineering and building pipelines both for data science and machine learning operations. My last article was entirely me evaluating what goes on in the world of data engineering, the tools I have seen organiztion use and whether there is a case for build your own vs use an open source tool.Scalable Data Engineering: A Case For Build Your Own Platform2021-01-30T00:00:00-08:002021-01-30T00:00:00-08:00https://adekunleba.github.io/Scalable-Data-Engineering:-a-case-for-build-your-own-platform<p>Delving deep into the realm of Data engineering, most especially using Scala as the programming language of choice, there seems to be a very basic thing for new entrants into the field, this applies both to the company trying to set up a data engineering platform and the engineer/consultant trying to propose a way to do this effectively. The question of how do I build my data platform right.</p>
<p>It is important to note that there are tools out there that help to consolidate data quickly for companies who don’t mind to outsource majority of their data engineering pipeline, however I have come to notice that a lot of bigger companies and deeply technology focused companies tends to build something specific to their use cases.</p>
<p><strong>Here are the examples I have seen:</strong></p>
<p>Tesla built its own data engineering platform, there is a beautiful talk and article around this from Colin breks, <a href="https://blog.colinbreck.com/the-state-of-the-art-for-iot/">here</a>.</p>
<p>LinkedIn, Pluralsight,Grammarly also have their own data engineering platform.
This brings the case of is it worth it building your own engine.</p>
<p>I will say this really depends <em>but there is a joy of flexibility and long term management</em> that is in building personalized data engineering platform if your are pretty medium size software company.</p>
<p>There is the ability to integrate perfectly into your own applications as well as the ability to control the direction of flow of your overall technology.</p>
<p>Need I also mention though that there is a cost and time overhead to this, but the joy is when you built something incredible that you can use across board and even build bigger application on. There is also a chance that the engineering experience will let you know how you stand as a company in the technology space.</p>
<p>Now for a beginner trying to enter into the landscape of Data engineering. The buzz is usually to know SQL and all, but beyond that, you need to be able to monitor and <strong>control your data ingestion layer as well as the transformation layer</strong>. And here is where the flexibility that we talked about comes to play. If you have a very huge user base, with several micro services working to help your user e.g the likes of Linkedin or tesla or netflix then usually, there is no ingestion vendor that might likely work for your overall architecture requirements. <em>PS: This should however not be taken at face value.</em></p>
<p>Nevertheless, when starting out in the data engineering lifestyle of a company, one can leverage available tools or vendors to start but should definitely put in place a plan to build an organization’s own scalable engine as the company’s requirement continues to grow.</p>
<p>Architecting the solution for scalable data engineering platform is another challenge. It is easy to write the software to adapt entirely to your use case and less agnostic and adaptable to differing cases. Most common issues are:</p>
<ul>
<li>I may need to know before-hand the schema(s) of my incoming data so that I can model my application as such.</li>
<li>If the incoming data schema changes, or there is a new field to be captured, does that mean we need to redeploy our data platform.</li>
</ul>
<p>I have seen several solutions that tends to solve this problem in a way, Linkedin released Gobblin, such an impresive project but it only just helps scale your data ingestion processes, you still need to manually write and deploy the data source connection and inner processing of the data.</p>
<p>Hydra is a Pluralsight’s tool that also tends to manage schema reusability and update while sending data to various sinks. This approach has the overhead that the client needs to manage pushing the data through an API.</p>
<p>The challenge still remains and I believe there are still some knowledge development on how to effectively build a solution that once you submit a job you can easily schedule it and everybody is happy.</p>Adekunle Babatundeadekunleba@gmail.comDelving deep into the realm of Data engineering, most especially using Scala as the programming language of choice, there seems to be a very basic thing for new entrants into the field, this applies both to the company trying to set up a data engineering platform and the engineer/consultant trying to propose a way to do this effectively. The question of how do I build my data platform right.Machine Learning Model Deployment With Cpp Part 22019-12-04T00:00:00-08:002019-12-04T00:00:00-08:00https://adekunleba.github.io/Machine-learning-model-deployment-with-cpp-Part-2<p>Picking up my initial article where I build a PCA model using CPP, In this article, I will be loading the saved model whose values are stored in a yaml file. This script will be developed to form an inference engine bundled as a <code class="language-plaintext highlighter-rouge">.so</code> file for deployment on a linux based environment. For other platforms, the cpp codes can be compiled to produce either a <code class="language-plaintext highlighter-rouge">dll</code> or <code class="language-plaintext highlighter-rouge">dylib</code> for windows and mac respectively.
Since we are more concerned about deploying the model on an android application, the focus will be building the <code class="language-plaintext highlighter-rouge">.so</code> file inference engine from our saved model.</p>
<h2 id="building-so-inference-file">Building <code class="language-plaintext highlighter-rouge">.so</code> inference file.</h2>
<h3 id="load-a-pca-model">Load a PCA model</h3>
<p>For inference, we let OpenCV load the existing model file from the saved <code class="language-plaintext highlighter-rouge">.yml</code> file after which we feed the eigenvalues and mean to a new <code class="language-plaintext highlighter-rouge">PCA</code> object which we can then call a <code class="language-plaintext highlighter-rouge">pca->project</code> on to create a new image’s projection.</p>
<p>Here is a sample code that load a saved OpenCV FileStorage model.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//Declare an empty model.</span>
<span class="n">cv</span><span class="o">::</span><span class="n">PCA</span> <span class="n">newPcaModel</span><span class="p">;</span>
<span class="n">cv</span><span class="o">::</span><span class="n">PCA</span> <span class="n">Facepca</span><span class="o">::</span><span class="n">loadmodel</span><span class="p">(</span><span class="n">cv</span><span class="o">::</span><span class="n">PCA</span> <span class="n">newPcaModel</span><span class="p">,</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">filename</span><span class="p">){</span>
<span class="n">cv</span><span class="o">::</span><span class="n">FileStorage</span> <span class="n">fs</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span><span class="n">cv</span><span class="o">::</span><span class="n">FileStorage</span><span class="o">::</span><span class="n">READ</span><span class="p">);</span>
<span class="n">fs</span><span class="p">[</span><span class="s">"mean"</span><span class="p">]</span> <span class="o">>></span> <span class="n">newPcaModel</span><span class="p">.</span><span class="n">mean</span> <span class="p">;</span>
<span class="n">fs</span><span class="p">[</span><span class="s">"e_vectors"</span><span class="p">]</span> <span class="o">>></span> <span class="n">newPcaModel</span><span class="p">.</span><span class="n">eigenvectors</span> <span class="p">;</span>
<span class="n">fs</span><span class="p">[</span><span class="s">"e_values"</span><span class="p">]</span> <span class="o">>></span> <span class="n">newPcaModel</span><span class="p">.</span><span class="n">eigenvalues</span> <span class="p">;</span>
<span class="n">fs</span><span class="p">.</span><span class="n">release</span><span class="p">();</span>
<span class="k">return</span> <span class="n">newPcaModel</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">loadmodel</span><span class="p">(</span><span class="n">newPcaModel</span><span class="p">,</span> <span class="s">"path-to-saved-yml-file"</span><span class="p">);</span>
</code></pre></div></div>
<p>Once the model is loaded, the <code class="language-plaintext highlighter-rouge">newPcaModel</code> object now contains the saved model from the existing training. Hence when a face projection is done, it is guaranteed that the data returned is relative to the training dataset.</p>
<h3 id="create-new-image-preprocessing-and-prediction-stage">Create new image preprocessing and prediction stage.</h3>
<p>During inference of a machine learning model, it is important that the incoming image also passes through the same preprocessing as the training dataset.
Also, several approaches can be used to pass image to the inference engine, it could be that the image is loaded from disk or that the image is passed as a base64 string.
In our case, the likely approach is to use a base64 string since we are also taking into consideration two factors, that a <code class="language-plaintext highlighter-rouge">jni</code> will be exposed and also that our final application is for an android application.</p>
<p>With this in mind, we then need to ensure that are able to retrieve image from a base64 string and send it to OpenCV.</p>
<p>Decoding a base64 string in c++ is quite non-trivial, however reader can refer to <a href="https://renenyffenegger.ch/notes/development/Base64/Encoding-and-decoding-base-64-with-cpp">this link</a> on a code snippet that does this.</p>
<p>Once the base64 image string is decoded, we then convert the string to a vector of unsigned character (uchar) which can be thought of as Image values.
OpenCV can decode the vector of uchar into an image using the function call to <code class="language-plaintext highlighter-rouge">cv::imdecode(vectorUchar, flag)</code>. This process returns a <code class="language-plaintext highlighter-rouge">Mat</code> image with which further preprocessing can be done.</p>
<p>The image can then pass through the phase of</p>
<ul>
<li>Face extraction</li>
<li>Convert cropped faces to gray scale.</li>
<li>Image resize</li>
<li>Image normalization</li>
<li>Create data matrix</li>
</ul>
<p>Just as described in the first part of the article.
The last leg of inference on the new image is using the <em>loaded</em> pca object to create a projection of the face in the new image.
The snippet that does that will look like this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">newPcaModel</span><span class="o">-></span><span class="n">project</span><span class="p">(</span><span class="n">datamatrix</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
</code></pre></div></div>
<p>The part of recognition or verication happens when you project with a <em>loaded</em> pca object on two faces and compare the projections (eigenfaces) using a distance metrics e.g euclidean distance or cosine similarity.</p>
<h3 id="package-the-library-with-an-exposed-jni">Package the library with an exposed jni</h3>
<p>This part is quite straightforward, once the inference code has been properly structured, most probably within a class, then the jni body can literally call the exposed functions for the model loading, preprocessing and prediction.
However, to create a jni, it is important to understand how the link occurs with java.
The first path is to create a Java class and the functions you want to use in your java class with the input parameters.
This will be what the jni will use as the function name when creating the cpp code. The consitency between the class and methods in Java with the name in cpp is very important.</p>
<p>Let us assume the method we will be using for our pca feature match is named <code class="language-plaintext highlighter-rouge">matchpcafeatures()</code> and our java class can then look like this.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="kn">package</span> <span class="nn">org.example.code</span>
<span class="kd">class</span> <span class="nc">MatchFeatures</span> <span class="o">{</span>
<span class="kd">static</span> <span class="kt">float</span> <span class="nf">matchpcafeatures</span><span class="o">(</span><span class="nl">modelfilename:</span> <span class="nc">String</span><span class="o">,</span> <span class="nl">image:</span> <span class="nc">String</span><span class="o">,</span> <span class="nl">projectionToCompare:</span> <span class="nc">Array</span><span class="o">[</span><span class="nc">Float</span><span class="o">])</span>
<span class="o">}</span>
</code></pre></div></div>
<p>With the above java class and method, our jni header file will look something like this.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="cp">#include <jni.h>
</span><span class="cm">/* Header for class com_example_code_MatchFeatures */</span>
<span class="cp">#ifndef _Included_com_example_code_MatchFeatures
#define _Included_com_example_code_MatchFeatures
#ifdef __cplusplus
</span><span class="k">extern</span> <span class="s">"C"</span> <span class="p">{</span>
<span class="cp">#endif
</span><span class="n">JNIEXPORT</span> <span class="n">jfloat</span> <span class="n">JNICALL</span> <span class="n">Java_com_example_code_MatchFeatures_matchpcafeatures</span>
<span class="p">(</span><span class="n">JNIEnv</span> <span class="o">*</span><span class="p">,</span> <span class="n">jobject</span><span class="p">,</span> <span class="n">jstring</span><span class="p">,</span> <span class="n">jstring</span><span class="p">,</span> <span class="n">jfloatArray</span><span class="p">);</span>
<span class="cp">#ifdef __cplusplus
</span><span class="p">}</span>
<span class="cp">#endif
#endif
</span>
</code></pre></div></div>
<p>You don’t need to bother yet about the details of the <code class="language-plaintext highlighter-rouge">extern C</code> the focus is on the name of the method in the jni header file.
Further down, we will look at how to use the java code above to make an inference. Let us first develop the jni bridge for this method.
The last 3 parameters of the jni header is just the same and in exact position as the parameters in the java method.</p>
<p>Therefore, we will process on thosee parameters as those are the key requirements from a client to our c++ inference engine.</p>
<p>Below shows how to connect those input parameters and return a value which the java code can take and continue its other processes.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="cm">/**
* Match features of pca
* */</span>
<span class="n">JNIEXPORT</span> <span class="n">jfloat</span> <span class="n">JNICALL</span> <span class="nf">Java_com_seamfix_qrcode_FaceFeatures_matchpcafeatures</span>
<span class="p">(</span><span class="n">JNIEnv</span> <span class="o">*</span> <span class="n">env</span><span class="p">,</span> <span class="n">jobject</span> <span class="n">obj</span><span class="p">,</span> <span class="n">jstring</span> <span class="n">pcafilename</span><span class="p">,</span> <span class="n">jstring</span> <span class="n">imagestring</span><span class="p">,</span> <span class="n">jfloatArray</span> <span class="n">projectionToCompare</span><span class="p">){</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pcastring_char</span><span class="p">;</span>
<span class="n">pcastring_char</span> <span class="o">=</span> <span class="n">env</span><span class="o">-></span><span class="n">GetStringUTFChars</span><span class="p">(</span><span class="n">pcafilename</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">pcastring_char</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">imagestring_char</span><span class="p">;</span>
<span class="n">imagestring_char</span> <span class="o">=</span> <span class="n">env</span><span class="o">-></span><span class="n">GetStringUTFChars</span><span class="p">(</span><span class="n">imagestring</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">imagestring_char</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">//Get file name string as a string for cpp</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">stdfilename</span><span class="p">(</span><span class="n">pcastring_char</span><span class="p">);</span>
<span class="n">cv</span><span class="o">::</span><span class="n">PCA</span> <span class="n">pca</span><span class="p">;</span>
<span class="c1">//Class InferencPca holds the preprocesing and inference methods</span>
<span class="n">InferencePca</span> <span class="n">ef</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">></span> <span class="n">imagevec</span><span class="p">;</span>
<span class="c1">//Get image as base64</span>
<span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">image</span> <span class="o">=</span> <span class="n">ef</span><span class="p">.</span><span class="n">readBase64Image</span><span class="p">(</span><span class="n">imagestring_char</span><span class="p">);</span>
<span class="n">ef</span><span class="p">.</span><span class="n">loadmodel</span><span class="p">(</span><span class="n">pca</span><span class="p">,</span> <span class="n">stdfilename</span><span class="p">);</span>
<span class="n">imagevec</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">image</span><span class="p">);</span>
<span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">datamatrix</span> <span class="o">=</span> <span class="n">ef</span><span class="p">.</span><span class="n">createdatamatrix</span><span class="p">(</span><span class="n">imagevec</span><span class="p">);</span>
<span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">projection</span> <span class="o">=</span> <span class="n">ef</span><span class="p">.</span><span class="n">project</span><span class="p">(</span><span class="n">datamatrix</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
<span class="c1">//Load the existing vector.</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="kt">float</span><span class="o">></span> <span class="n">initialProjectionPoints</span><span class="p">;</span>
<span class="c1">//Load existing face features</span>
<span class="n">jsize</span> <span class="n">intArrayLen</span> <span class="o">=</span> <span class="n">env</span><span class="o">-></span><span class="n">GetArrayLength</span><span class="p">(</span><span class="n">existingfeatures</span><span class="p">);</span>
<span class="n">jfloat</span> <span class="o">*</span><span class="n">pointvecBody</span> <span class="o">=</span> <span class="n">env</span><span class="o">-></span><span class="n">GetFloatArrayElements</span><span class="p">(</span><span class="n">existingfeatures</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">intArrayLen</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">initialProjectionPoints</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">pointvecBody</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="kt">float</span><span class="o">></span> <span class="n">newProjectionpoints</span> <span class="o">=</span> <span class="n">ef</span><span class="p">.</span><span class="n">matToVector</span><span class="p">(</span><span class="n">projection</span><span class="p">);</span>
<span class="kt">float</span> <span class="n">comparisonScores</span> <span class="o">=</span> <span class="n">ef</span><span class="p">.</span><span class="n">compareProjections</span><span class="p">(</span><span class="n">newProjectionpoints</span><span class="p">,</span> <span class="n">initialProjectionPoints</span><span class="p">);</span>
<span class="n">env</span><span class="o">-></span><span class="n">ReleaseFloatArrayElements</span><span class="p">(</span><span class="n">existingfeatures</span><span class="p">,</span> <span class="n">pointvecBody</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">env</span><span class="o">-></span><span class="n">ReleaseStringUTFChars</span><span class="p">(</span><span class="n">pcafilename</span><span class="p">,</span> <span class="n">pcastring_char</span><span class="p">);</span>
<span class="c1">//Comparte</span>
<span class="k">return</span> <span class="n">comparisonScores</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>With this we are set to build our <code class="language-plaintext highlighter-rouge">.so</code> library and give it to a java code to make use successfully.</p>
<p>The protocol to build the library is stated in the <code class="language-plaintext highlighter-rouge">CMakeLists.txt</code>.
It looks like below:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>find_package<span class="o">(</span>JNI REQUIRED<span class="o">)</span>
<span class="c">#Include jni directories</span>
include_directories<span class="o">(</span><span class="k">${</span><span class="nv">JNI_INCLUDE_DIRS</span><span class="k">}</span><span class="o">)</span>
file <span class="o">(</span>GLOB_RECURSE SOURCE_FILE src/<span class="k">*</span>.h src/<span class="k">*</span>.cpp<span class="o">)</span>
<span class="nb">set</span><span class="o">(</span>LIBS <span class="k">${</span><span class="nv">JNI_LIBRARIES</span><span class="k">}</span><span class="o">)</span>
add_library<span class="o">(</span>matchprojection SHARED <span class="k">${</span><span class="nv">SOURCE_FILE</span><span class="k">}</span><span class="o">)</span>
target_link_libraries<span class="o">(</span>matchprojection <span class="k">${</span><span class="nv">LIBS</span><span class="k">}</span><span class="o">)</span>
</code></pre></div></div>
<p>Building the project should generate a <code class="language-plaintext highlighter-rouge">lib-matchprojection.so</code> file which can be added to your java project.</p>
<p>However, for android, it is a little tricky in the sense that the build tool is different rather than the official cmake build toolchain, Android has its own build tool for native cpp codes. This is called Native development kit (NDK). This is what will be used to build you c++ native codes for the generated <code class="language-plaintext highlighter-rouge">.so</code> to be compatible with Android.
Building <code class="language-plaintext highlighter-rouge">.so</code> for android using NDK will be a whole tutorial on its own hence I will be skipping that for now.
But generally, once the build is complete using an NDK, you will have the same <code class="language-plaintext highlighter-rouge">lib-matchprojection.so</code> which can be used in your android application.</p>
<h3 id="using-the-so-file-within-android-application">Using the <code class="language-plaintext highlighter-rouge">.so</code> file within Android application.</h3>
<p>Using the generated library within the Android application is just the same as using it in any java application.
The idea is to load the native library and then make a call with the required parameter to the methods in the initially created Classes that correspond to the native jni method.
To load the library in any java program including android, ensure the <code class="language-plaintext highlighter-rouge">.so</code> library is in the class path of your program, some will put it under a folder called <code class="language-plaintext highlighter-rouge">lib</code>. With this I can use a function call to load the library as below:</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">System</span><span class="o">.</span><span class="na">loads</span><span class="o">(</span><span class="s">"native-lib"</span><span class="o">)</span>
</code></pre></div></div>
<p>Finally I can make a call to my methods created earlier with the parameters necessary and the native code then can execute for me.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="nc">MatchFeatures</span> <span class="n">mf</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">MatchFeatures</span><span class="o">();</span>
<span class="kt">float</span> <span class="n">matchscores</span> <span class="o">=</span> <span class="n">mf</span><span class="o">.</span><span class="na">matchpcafeatures</span><span class="o">(</span><span class="n">storedpacfilepath</span><span class="o">,</span> <span class="n">imagebase64string</span><span class="o">,</span> <span class="n">anotherimageprojectionarray</span><span class="o">);</span>
</code></pre></div></div>
<p>If you notice carefully, the method was declared native and there is no body for the method, this is because the programmes understand that there is a native cpp method that has been defined with this classpath name.</p>
<h3 id="conclusion">Conclusion:</h3>
<p>This approach is a fundamental way to build and deploy project that include native codes to a java environment. Also,most complex algorithmic problems including machine learning anc core computer vision project can easily be reason around in cpp most because it is fast and there are production ready libraries available.
Even tensorflow has an api for loading deep learning models in c++ as well as using <code class="language-plaintext highlighter-rouge">tflite</code> models in c++.
Hence I see this run through approach as a way to build robust production ready engines that will leverage high precision mathematics and most especially deployment of machine learning models to various environment and in particular android environment in an offline environment.</p>Adekunle Babatundeadekunleba@gmail.comPicking up my initial article where I build a PCA model using CPP, In this article, I will be loading the saved model whose values are stored in a yaml file. This script will be developed to form an inference engine bundled as a .so file for deployment on a linux based environment. For other platforms, the cpp codes can be compiled to produce either a dll or dylib for windows and mac respectively. Since we are more concerned about deploying the model on an android application, the focus will be building the .so file inference engine from our saved model.Machine Learning Model Deployment With Cpp Part 12019-11-12T00:00:00-08:002019-11-12T00:00:00-08:00https://adekunleba.github.io/Machine-learning-model-deployment-with-cpp-Part-1<p>Recently I have been fascinated with how interesting it is to build mathematically inclined application and deploy at scale and without any restriction of model size, platform or need for api calls. I know that python has enough library for working with prototypes of machine learning project, however not many are talking about scaling this project especially when you don’t want to do that over a web api.
I believe true intelligence shouldn’t rely only on calls to an api for a model to be available in scale, this fascination led me to research into what it will take to use C++ for machine learning and general intelligence.
My convitiction is that both matlab and python’s mathematical strenght are based on an underlying c/c++ code hence fundamentally scaling technology to work with mathematical computations involved in machine learning with blazing fast scenerio in mind will likely require that you are able to dig into low level programming and most especially with c/c++, I choose c++.
Also, I wondered why fundamentally most computer science schools ensures that there is a c/c++ curricullum in there study, this emphasizes the reasoning of using c/c++ for scalable technology.</p>
<p>After learning c++ using <a href="https://www.udemy.com/course/free-learn-c-tutorial-beginners/">a udemy handson course</a> the challenge is now to integrate a simple face recognition application in an android.</p>
<p>The write-up will include some prelimnary approach of what you need to build a c++ project and deploy in android or any other os environment.</p>
<p>Components:</p>
<ul>
<li>Set up a c++ project for machine learning with opencv.</li>
<li>Learning a PCA to generate eigen faces</li>
<li>Setting up a .so inference library for multiplatform deployment</li>
<li>Developing a jni wrapper for the inference library.</li>
<li>Deployment in Android using ndk and android studio.</li>
</ul>
<p>The first part will be to learn a machine learning algorithm with OpenCV. In this case we are going to explore the most basic of face recognition algorithm, using Principal component for eigenfaces. The machine learning community is very familiar with this in python especially with tools such as scikit learn, but when production and most especially offline/on-device production comes to mind, the need to do this from a different dimension is expedient.
OpenCV comes with a very good api for learning principal component analysis and it is quite straight forward to learn once you have your data all set up.
Here are the steps:</p>
<ul>
<li>Set up a cmake project c++ using OpenCV.
The key parts of your CMakeFile.txt is ensuring that there is an OpenCV library in your project and available for your library to compile. Ideally, before requesting that Cmake find OpenCV for you, it is important to have the OpenCV library installed on your machine.</li>
</ul>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>find_package<span class="o">(</span>OpenCV REQUIRED<span class="o">)</span>
include_directories<span class="o">(</span><span class="k">${</span><span class="nv">OpenCV_INCLUDE_DIRS</span><span class="k">}</span><span class="o">)</span>
<span class="nb">set</span><span class="o">(</span>LIBS <span class="k">${</span><span class="nv">OpenCV_LIBS</span><span class="k">}</span><span class="o">)</span>
target_link_libraries<span class="o">(</span>featureCalculation <span class="k">${</span><span class="nv">LIBS</span><span class="k">}</span><span class="o">)</span>
</code></pre></div></div>
<p><em>PS: I had to set up my cmake differently for training and inference. I will share both</em></p>
<ul>
<li>Learn a Principal component analysis.
Now that OpenCV is available, learning PCA is quite straight forward. Here is the logic invole:</li>
<li>Read all image data as an array.
Using <code class="language-plaintext highlighter-rouge">cv::glob</code> from opencv, all filenames ending with a <code class="language-plaintext highlighter-rouge">.jpg, .png or/and .jpeg</code> can be read with <code class="language-plaintext highlighter-rouge">cv::imread</code>, and the data preprocessing of the image data can proceed.</li>
<li>Crop faces as PCA does much better with the face image than the whole image.
I have found Multi-task Cascaded Convolutional Networks (MTCNN) to be the most reliable yet simple and minimal face detection and cropping model out there. There is an implementation in C++ using the original model with a Caffe network in Opencv (<em>topic for another article - Face detection and cropping in production using Opencv</em>).</li>
<li>Convert cropped faces to gray scale.
This part is pretty straight forward. Using <code class="language-plaintext highlighter-rouge">cv::cvtColor(originalimagemat, grayscaleimagematcontainer, cv::COLOR_BGR2GRAY)</code> we can convert an original BGR image to grayscale in OpenCV.</li>
<li>Other preprocessing - One other preprocessing is to ensure that the data types are right. This is very important because C++ is heavy on precision of data types. It is quite easy to introduce bugs at this point hence the reason to carefully ensure that your data types are right. Apart from this, it is a good idea to normalize your image data and resize all images into a consistent shape. PCA works only if the data are in the same dimension. Out of the bos from opencv we can employ the following functions to take care of the preprocessing:
<code class="language-plaintext highlighter-rouge">cv::resize(originalImage, containermatofnewimage, size)</code> for resizing the image and <code class="language-plaintext highlighter-rouge">originalmat::convertTo(newmatnormalized, CV_32FC3, 1/255.0)</code></li>
<li>Convert all images to a data table - A data table is somewhat like a single table of data where each element is represented as a row, interestingly we can think of each row of our data table as individual images in it’s flattened format. The essence of PCA is to project the image values to few columns with distinct representation of that image. Therefore, the data table will have rows equals to the number of images in the training dataset while the columns will be the normalized grayscale values of each image.
To create the data table, <code class="language-plaintext highlighter-rouge">std::vector</code> can be used to hold all the images (with the hope that they fit in memory) which is then copied to every row of the data matrix. Here is an helper function that does exactly that from a vector of images mat.</li>
</ul>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="nf">createdatamatrix</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">></span> <span class="n">imageArray</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">datamatrix</span><span class="p">(</span><span class="k">static_cast</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="n">imageArray</span><span class="p">.</span><span class="n">size</span><span class="p">()),</span> <span class="n">imageArray</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">rows</span> <span class="o">*</span> <span class="n">imageArray</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">cols</span><span class="p">,</span> <span class="n">CV_32F</span><span class="p">);</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">imageArray</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">imageRow</span> <span class="o">=</span> <span class="n">imageArray</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">clone</span><span class="p">().</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">rowIData</span> <span class="o">=</span> <span class="n">datamatrix</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
<span class="n">imageRow</span><span class="p">.</span><span class="n">copyTo</span><span class="p">(</span><span class="n">rowIData</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">datamatrix</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">cv::reshape()</code> helps transform mat arrays to different shapes, with <code class="language-plaintext highlighter-rouge">(1, 1)</code> it literally means we want the data to exist in a single row.</p>
<ul>
<li>Learn the actual Pricipal component analysis algorithm.
Now that we have created the data table with the preprocessed face images, learning the PCA model is usually smooth. As smooth as passing the data table to an OpenCV pca instance with your expected maximum components like such <code class="language-plaintext highlighter-rouge">cv::PCA pca(datatable, cv::Mat(), cv::PCA::DATA_AS_ROW, number_of_components)</code>. With this we have a learned PCA written in C++ which is ready for production use.</li>
</ul>
<p>To transfer this model for use in any environment, open cv has a <code class="language-plaintext highlighter-rouge">FileStorage</code> object that allows you to save a mat as it is. Thus I can save this file and pass its filename over jni for OpenCV to recreate the model instances for inference. Well it simply as sweet as that to serve the model.
To conclude the inference part of the article, I will simply show how to write the mat object using OpenCV. At the end of the day, the values in the saved model comes out as either a YAML file or XML depending on the choice most pleasant to the user.</p>
<ul>
<li>Save the pca model object for inference on production environment.
What exactly needs to be saved in the pca object are the mean and eigenvectors of the trained pca, sometimes it may be a good idea to also save the eigen values in case you want to construct your own eigenfaces projection, but OpenCV already implemented a <code class="language-plaintext highlighter-rouge">pca->project</code> instance that helps in inference and eigenfaces generation. In any case here is how to save your model:</li>
</ul>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Facepca</span><span class="o">::</span><span class="n">savemodel</span><span class="p">(</span><span class="n">cv</span><span class="o">::</span><span class="n">PCA</span> <span class="n">pcaModel</span><span class="p">,</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">filename</span><span class="p">){</span>
<span class="n">cv</span><span class="o">::</span><span class="n">FileStorage</span> <span class="n">fs</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span><span class="n">cv</span><span class="o">::</span><span class="n">FileStorage</span><span class="o">::</span><span class="n">WRITE</span><span class="p">);</span>
<span class="n">fs</span> <span class="o"><<</span> <span class="s">"mean"</span> <span class="o"><<</span> <span class="n">pcaModel</span><span class="p">.</span><span class="n">mean</span><span class="p">;</span>
<span class="n">fs</span> <span class="o"><<</span> <span class="s">"e_vectors"</span> <span class="o"><<</span> <span class="n">pcaModel</span><span class="p">.</span><span class="n">eigenvectors</span><span class="p">;</span>
<span class="n">fs</span> <span class="o"><<</span> <span class="s">"e_values"</span> <span class="o"><<</span> <span class="n">pcaModel</span><span class="p">.</span><span class="n">eigenvalues</span><span class="p">;</span>
<span class="n">fs</span><span class="p">.</span><span class="n">release</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the next part of this article, I will explain how to set up an inference library in c++ while using jni as your connector to your inference library.</p>Adekunle Babatundeadekunleba@gmail.comRecently I have been fascinated with how interesting it is to build mathematically inclined application and deploy at scale and without any restriction of model size, platform or need for api calls. I know that python has enough library for working with prototypes of machine learning project, however not many are talking about scaling this project especially when you don’t want to do that over a web api. I believe true intelligence shouldn’t rely only on calls to an api for a model to be available in scale, this fascination led me to research into what it will take to use C++ for machine learning and general intelligence. My convitiction is that both matlab and python’s mathematical strenght are based on an underlying c/c++ code hence fundamentally scaling technology to work with mathematical computations involved in machine learning with blazing fast scenerio in mind will likely require that you are able to dig into low level programming and most especially with c/c++, I choose c++. Also, I wondered why fundamentally most computer science schools ensures that there is a c/c++ curricullum in there study, this emphasizes the reasoning of using c/c++ for scalable technology.Linear Regression With Tensorflow Updated2019-01-01T00:00:00-08:002019-01-01T00:00:00-08:00https://adekunleba.github.io/Linear-Regression-with-Tensorflow-Updated<p>Yaay!!! Welcome to the new year 2019, this is going to be my first post in the year, I am glad about it as I get to start the year on a very high vibe.</p>
<figure>
<img src="/images/tfeage.png" />
</figure>
<p>To the matter on ground, Data Science and Machine learning has come a long way and still very much evolving fast, one of the tool that has helped achieve great advancement in Machine learning in particular is Tensorflow, I <a href="https://adekunleba.github.io/Linear-Model-with-Tensorflow/">posted</a> sometimes ago, about starting out with Tensorflow with the most basic of example - building a linear regression model, however a lot has change. Some of things that have changed include the following:</p>
<ul>
<li>Tensorflow now uses tf.keras as it’s base model definition</li>
<li>Tensorflow uses <code class="language-plaintext highlighter-rouge">eager execution</code> as it’s default approach to running your models, rather than the initial <code class="language-plaintext highlighter-rouge">graph and session</code> approach</li>
<li>A lot of clean-ups are also on the way in <code class="language-plaintext highlighter-rouge">tf.slim</code> api of Tensorflow is being merged to the core api.</li>
<li>Tensorflow now feels more python-like than the initial C-like presentation it forces developer into.</li>
</ul>
<p>However, with this comes a new way to approach working with Tensorflow, I made some examples on how to quickly get started with Tensorflow in the <code class="language-plaintext highlighter-rouge">Eager Execution</code> mode. This mode is actually painless compared to what we had in Tensorflow even exactly 1 year ago.</p>
<p>It allows to build an imperative, define-by-run approach to doing machine learning, and it has auto differentiation that allows you to automatically calculate the gradient of your forward pass operations.</p>
<p><strong>Basic Operations With Tensorflow</strong></p>
<p>Let’s go into how to do some basic operations in Tensorflow. It is important to remember that Tensorflow is a library that allows you to do <code class="language-plaintext highlighter-rouge">numerical computation</code>, so many times, people have had to forget this notion and only see Tensorflow as a Deep Neural Network Library, we generally need to settle on the fact that Tensorflow is a library that helps you maximize data, by abstracting away the pain of building <strong>many</strong> machine learning models including Linear and Logistic Regression models. I believe this is the reason why every Tensorflow basic tutorial starts with a Linear Regression, so that the <code class="language-plaintext highlighter-rouge">numerical computation</code> foundation of Tensorflow is not forgotten.</p>
<p>To begin tensorflow in eager execution, it’s imperative that we import tensorflow and establish that we want to use eager execution at the top of our application. Once enabled the first time, it’s enabled for the run-time of the current application.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="n">tf</span><span class="p">.</span><span class="n">enable_eager_execution</span><span class="p">()</span>
</code></pre></div></div>
<p>a. Do basic Operations like sum and multiplication</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Add and multiply 2 Scalas
</span><span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">multiply</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
<span class="c1">#Add and multiply 2 matrixes.
</span><span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">add</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
<span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]],</span> <span class="p">[[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
<span class="p">[</span><span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">]]))</span>
<span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">multiply</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
<span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]],</span> <span class="p">[[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
<span class="p">[</span><span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">]]))</span>
</code></pre></div></div>
<p>b. Do Square and get mean of Scala and Matrixes</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Square a scalar and Vector
</span><span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">square</span><span class="p">([[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">]]))</span>
<span class="c1">#Get the mean of 1-D and 2-D array of Number
</span><span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">reduce_mean</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span><span class="mi">8</span> <span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">]))</span>
<span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">reduce_mean</span><span class="p">([[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
<span class="p">[</span><span class="mi">40</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">],</span>
<span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">]],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
</code></pre></div></div>
<p>c. Get the minimum value in a 1-D and 2-D Array</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Get the minimum value in a 1-D Array
</span><span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">reduce_min</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span><span class="mi">8</span> <span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">]))</span>
<span class="c1">#Get the minimum value in a 2-D Tensor along a particular axis
</span><span class="k">print</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">reduce_min</span><span class="p">([[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
<span class="p">[</span><span class="mi">40</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">],</span>
<span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">]],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
</code></pre></div></div>
<p>d. Some few other operations</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Generating Random Numbers
</span><span class="n">x</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">random_uniform</span><span class="p">([</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1">#Creating Tensor Constants in Eager Execution
</span><span class="n">y</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">constant</span><span class="p">([[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
<span class="p">[</span><span class="mi">40</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">],</span>
<span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span>
<span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="c1">#Calculate Sigmoid of an Array of Numbers
</span><span class="n">sig</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">sig</span><span class="p">)</span>
</code></pre></div></div>
<p><strong><em>Full notebook can be found <a href="https://colab.research.google.com/github/adekunleba/tensorflow_tutorial/blob/master/IntroductiontoTensorflowOperations.ipynb">here</a></em></strong></p>
<p>Let’s take the operations a little further and try to build a simple linear regression model with this. I love to think about machine learning projects as a scientist trying to make some data useful, either so that they can build a model that will essentially abstract some thought portion from human, or in all build a product from data that will eventually make life painless for human in terms of helping to solve quite challenging scenerios, for example, building a self driving car could be seen as applying machine learning algorithm to environment data including images and weather data to help the car make it’s next decision.</p>
<p>So, we can generate a toy data using tensorflow random number generation techniques that gives us what can be used for training a linear regression model.</p>
<p>Highlights of the code below is to :</p>
<ul>
<li>Generate random numbers with tf.random_normal as your x</li>
<li>Compute a random y for each elements in X</li>
<li>Convert the generated data to Tensorflow Dataset.</li>
</ul>
<p><em>Tensorflow Dataset allows you to build a robust data pipeline and ensuring you keep track of the X and y accordingly in your pipeline. tf.data.Dataset represents a sequence of elements where each of the element is a Tensor Object (internal data representation for tensorflow) with a single trianing sample usually having the X and Y. You can think of X as anything from Columns of each entity of your data to images. Another good advantage of tensorflow Dataset is how you are able to batch easily.</em></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">def</span> <span class="nf">make_synthetic_data</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">noise_level</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">num_batches</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">batch</span><span class="p">(</span><span class="n">_</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">random_normal</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">shape</span><span class="p">(</span><span class="n">w</span><span class="p">)[</span><span class="mi">0</span><span class="p">]])</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span> <span class="o">+</span> <span class="n">noise_level</span> <span class="o">+</span> <span class="n">tf</span><span class="p">.</span><span class="n">random_normal</span><span class="p">([])</span>
<span class="k">return</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span>
<span class="c1">#Use Cpu to do this
</span> <span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"/device:CPU:0"</span><span class="p">):</span>
<span class="k">return</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="nb">range</span><span class="p">(</span><span class="n">num_batches</span><span class="p">).</span><span class="nb">map</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span>
</code></pre></div></div>
<p>With the data generation function, we can go ahead to generate our data</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">true_w</span> <span class="o">=</span> <span class="p">[[</span><span class="o">-</span><span class="mf">2.0</span><span class="p">],</span> <span class="p">[</span><span class="mf">4.0</span><span class="p">],</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">]]</span>
<span class="n">true_b</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.5</span><span class="p">]</span>
<span class="n">noise_level</span> <span class="o">=</span> <span class="mf">0.01</span>
<span class="c1"># Training constants.
</span><span class="n">batch_size</span> <span class="o">=</span> <span class="mi">64</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">make_synthetic_data</span><span class="p">(</span><span class="n">true_w</span><span class="p">,</span> <span class="n">true_b</span><span class="p">,</span> <span class="n">noise_level</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span>
</code></pre></div></div>
<p>The next procedure is usually to explore your data, however, in our case we are working with a toy example, we can skip that part and move on to the meat of what we are actually here for - how will tensorflow help us build a model against this data.</p>
<p>The approach is highlighted thus:</p>
<ul>
<li>Make a model - we use <code class="language-plaintext highlighter-rouge">tf.keras.layers.Dense</code>. The question that came to mind with this is how does Dense layer represents regression? Dense layer input data is usually a flattened array which is synonymous to your normal <code class="language-plaintext highlighter-rouge">X</code> data with many columns on which you want to try and get your <code class="language-plaintext highlighter-rouge">weight</code> and <code class="language-plaintext highlighter-rouge">bias</code>. Furthermore, the loss function used, helps the model to fully understand what we are trying to do.</li>
<li>Write the loss function, for our sake, it’s going to be a Mean Square Error, which is one of the <code class="language-plaintext highlighter-rouge">loss functions</code> used in training Regression models.</li>
<li>Write the auto-differentiation algorithm</li>
<li>Write the optimizer to use to train the model.</li>
<li>Train your model</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">layers</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">mse</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">xs</span><span class="p">,</span> <span class="n">ys</span><span class="p">:</span> <span class="n">tf</span><span class="p">.</span><span class="n">reduce_mean</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">subtract</span><span class="p">(</span><span class="n">model</span><span class="p">(</span><span class="n">xs</span><span class="p">),</span> <span class="n">ys</span><span class="p">)))</span>
<span class="n">tfe</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">contrib</span><span class="p">.</span><span class="n">eager</span>
<span class="n">loss_and_grads</span> <span class="o">=</span> <span class="n">tfe</span><span class="p">.</span><span class="n">implicit_value_and_gradients</span><span class="p">(</span><span class="n">f</span><span class="o">=</span><span class="n">mse</span><span class="p">)</span>
<span class="c1">#Build Optimizer
</span><span class="n">optimizer</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">train</span><span class="p">.</span><span class="n">GradientDescentOptimizer</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="n">learning_rate</span><span class="p">)</span>
<span class="c1">#Train the model
</span><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">xs</span><span class="p">,</span> <span class="n">ys</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">tfe</span><span class="p">.</span><span class="n">Iterator</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span>
<span class="n">loss</span><span class="p">,</span> <span class="n">grads</span> <span class="o">=</span> <span class="n">loss_and_grads</span><span class="p">(</span><span class="n">xs</span><span class="p">,</span> <span class="n">ys</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Iteration %d: loss = %s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">loss</span><span class="p">.</span><span class="n">numpy</span><span class="p">()))</span>
<span class="n">optimizer</span><span class="p">.</span><span class="n">apply_gradients</span><span class="p">(</span><span class="n">grads</span><span class="p">)</span>
</code></pre></div></div>
<p>This approach is the basic minimum in building a machine learning model with Tensorflow, in most case, you will just have to structure your code to be more robust on the above process when dealing with bigger projects such as Image Recognition, Text analysis and other machine learning projects. there is a sample in the accompanying notebook on how to bring the process above together in a python scripts of functions and class.</p>
<p>Some other things you will want to add in-between the codes is using <code class="language-plaintext highlighter-rouge">Tensorboard</code> to monitor the training procedure and <code class="language-plaintext highlighter-rouge">model checkpointing</code> to ensure models are saved for use later.</p>
<p><strong><em>The full code for training a Linear regression model can be found <a href="https://colab.research.google.com/github/adekunleba/tensorflow_tutorial/blob/master/Linear_Regression_with_Tensorflow.ipynb">here</a></em></strong></p>Adekunle Babatundeadekunleba@gmail.comYaay!!! Welcome to the new year 2019, this is going to be my first post in the year, I am glad about it as I get to start the year on a very high vibe.Using Deep Learning Model To Create A Face Recognition System2018-11-10T00:00:00-08:002018-11-10T00:00:00-08:00https://adekunleba.github.io/Using-Deep-Learning-Model-to-create-a-face-recognition-system<p>I recently had to work on a project to build a face-recognition engine that will be used in production. Here I am going to describe on an high level things that were done.</p>
<h2 id="what-is-a-face-recognition-system">What is a Face Recognition system</h2>
<p>A face recognition system is a system that has the ability to use a person’s facial properties for identification, verication or recognition. Early facial recognition systems (FCS) makes use of Principal Component Analysis in generating face features. Using this method, the features generated were termed Eigenfaces. Eigen faces are more of a lower dimensional representations of a face image i.e Consider a cropped face image, you then make use of Principal component analysis to make a lower dimension representation of the pixel values in the face image.</p>
<h2 id="deep-learning-for-better-face-features">Deep Learning for better Face Features.</h2>
<p>With the Advent of Deep Learning model, feature generation from faces are now done in a much effective and accurate way. The state of the art face recognition technologies now employ the use of Deep Neural networks as observed from Labeled Faces in the Wild which is one of the benchmarks that is use in comparing the effectivenes of Face Recognition systems. Currently the leading models are all Deep Learning models, Facebook’s DeepFace has an accuracy of 0.9735, Google’s FaceNet has an accuracy of 0.9963 compared to the original EigenFaces which has an accuracy of 0.6002.</p>
<p>Thus to build a production ready face recognition system, there are some basics components that your application should have.</p>
<ol>
<li>Face Detection and Alignment system</li>
<li>A face feature generating model</li>
<li>Verification/Identification/Recognition layer.</li>
</ol>
<p>All these three components must be coupled together to have a functional state of the art Face Recognition system.</p>
<h2 id="components-of-a-face-recognition-system">Components of a face Recognition system</h2>
<h3 id="a-face-detection-and-alignment-component">a. Face Detection and Alignment Component:</h3>
<p>For most face recognition system, it’s important to extract the face portion in images before passing it to your model. In DeepFace paper, the first line in the abstract writes thus:</p>
<blockquote>
<p>In modern face recognition, the conventional pipeline consists of four stages: detect ⇒ align ⇒ represent ⇒ classify</p>
</blockquote>
<p>There are many technologies used in face detection and alignment.
For Detection we can explore the use of Weak classifier cascades . Open CV have couple of Haar features that worked well for our use case.
Also there is the Multi-task Cascaded Convolutional Networks which can both do face detection and alignment.
Here is a link to learn about both <a href="https://facedetection.com/algorithms/">Haar Features</a> and <a href="https://kpzhang93.github.io/MTCNN_face_detection_alignment/index.html">MTCNN</a></p>
<p>An ongoing project from my end is to write MTCNN in Scala, but currently my project made use of the Haars Cascade Classifiers.</p>
<h3 id="b-a-face-feature-generating-model">b. A Face Feature generating model</h3>
<p>As recently mentioned above, the most accurate face feature generating model for a face recognition system is a Deep Learning model. Also, this is also a vital part that determines a lot in the whole system. Many models are available and have been open sourced. Facebook’s DeepFace and Google’s Facenet are very prominent open sourced face feature generating model.
For Facebook’s Deep Face it contains two Convolutions with a Max poolings in-between them, Local Convolutions and Fully connected network. Local Convolutions basically use a different set of learned weights at every pixel, this is compared to a Normal Convolution which uses same set of weights at all locations. In simple explanation, you basically just use a giant filter and neglect doing much convolution. [Here]https://prateekvjoshi.com/2016/04/12/understanding-locally-connected-layers-in-convolutional-neural-networks/) you can read more about Local Convolution.</p>
<h3 id="c-a-final-metric-learning-layer-for-verificationidentificationrecognition">c. A Final Metric learning layer for Verification/Identification/Recognition:</h3>
<p>A face passed through a signature generating model generates a D-Dimensional feature vector which is representative of a person’s face, once the model generates the face signatures, a metric learning algorithm or some other distance calculating algorithms compares the generated features for closeness in distance.
Some metrics used for this includes:</p>
<ul>
<li>Cosine Metrics</li>
<li>Siamese Networks</li>
</ul>
<h2 id="conclusion">Conclusion:</h2>
<p>For deployment of this composed method, you definitely need to implement this components facing a database, Serve your Tensorflow model from a point, and wrap the whole project around a backend service, for my case I used Akka-http to build the backend and this forced the project towards the use of JVM for tensorflow model serving.</p>
<p><em>I hopes this article benefits someone who is willing to build a simple face recognition engine for themselves.</em></p>Adekunle Babatundeadekunleba@gmail.comI recently had to work on a project to build a face-recognition engine that will be used in production. Here I am going to describe on an high level things that were done.Summary From Thinking Like A Data Scientist(part 1)2018-11-05T00:00:00-08:002018-11-05T00:00:00-08:00https://adekunleba.github.io/Summary-from-Thinking-Like-a-Data-Scientist(Part-1)<h3 id="here-are-some-of-the-vital-points-i-got-from-the-book-think-like-a-data-scientist-by-brian-godsey">Here are some of the vital points I got from the book Think Like a Data Scientist by Brian Godsey.</h3>
<p><a href="https://www.amazon.com/gp/product/1633430278?ie=UTF8&camp=213733&creative=393185&creativeASIN=1633430278&linkCode=shr&tag=amz0e61-20&linkId=ISJCGPBN76JSSY6L&s=books&qid=1528546816&sr=1-19&keywords=data+science&refinements=p_n_feature_browse-bin:2656022011">Link to a copy</a></p>
<p>This is a summary of Chapters 1 and 2.</p>
<p>If a self-driving car makes it 90% of the way to the finish line but is washed into a ditch by a rainstorm, it would hardly be appropriate to say that the autonomous car doesn’t work.</p>
<p><strong>Priorities:</strong></p>
<ul>
<li>Knowledge first</li>
<li>Technology second</li>
<li>Opinions third</li>
</ul>
<p>Use this to help settle disputes in the never-ending battle between the various concerns of every data science project—
for example, software versus statistics, changing business need versus project timeline, data quality versus accuracy of results.</p>
<p>Often people are blinded by what they think is possible, and they forget to consider that it might not be possible or that it might be much more expensive than estimated. GUILTY!!!</p>
<p>Some key things to try and incorporate as a DS.</p>
<ol>
<li>Documentation- for future self and others that may work on your project</li>
<li>Code repository and versioning.</li>
<li>Code organization - Useful especially for Cove re-use</li>
<li>Ask questions- from business,Software guys , PMs.</li>
<li>Stay close to the data- Sometimes simple algorithm is all you need.</li>
</ol>
<h4 id="on-project-goals-and-client-expectations">On Project Goals and Client expectations:</h4>
<p>A notable difference between many fields and data science is that in data science, if a customer has a wish, even an experienced data scientist may not know whether it’s possible.
Ensuring a data scientist communicate uncertainties to expect in a project should be one of the early TODOs.</p>
<p>Treat goal discussions between a client and yourself as somewhat finding common grounds. Since expectations can be high but may be unrealistic considering many factors.
Sometimes, it is good to lay the foundation of final products on suggestions of what it will look like.</p>
<p>Important to keep and distinguish facts from opinions. Judgements should be based on facts.</p>
<p>You will need to learn how to manage the Salesman claims of your in-development projects, It will offen happen that client will be selling the project in development which you are not even sure will work 100%.</p>
<ul>
<li>No one ever wants to declare failure, but data science is a risky business, and to pretend that failure never happens is a failure in itself.</li>
</ul>
<p>Two Dangerous pitfalls from Data you may want to avoid:</p>
<ol>
<li>Expecting data to answer questions it can’t</li>
<li>Asking questions from data that doesn’t solve original problem.</li>
</ol>
<p><strong>The beauty of negative results:</strong> It probably forces you to rethink your project towards a more informed solution.</p>
<p>Litmus test for the goal of a DS project:</p>
<ul>
<li>What is possible</li>
<li>What is valuable</li>
<li>What is efficient.</li>
</ul>Adekunle Babatundeadekunleba@gmail.comHere are some of the vital points I got from the book Think Like a Data Scientist by Brian Godsey.Deploying A Machine Learning Model With Tensorflow Serving, Flask And Docker (part 1)2018-10-16T00:00:00-07:002018-10-16T00:00:00-07:00https://adekunleba.github.io/Deploying-a-machine-learning-model-with-tensorflow-serving,-flask-and-docker-(Part-1)<p>Having worked with Machine Learning model for quite sometimes, the basic challenge has been deployment of the model in production. With this in mind google created Tensorflow Serving which is supposed to be an ideal environment for running models in production.</p>
<p>Tensorflow serving main points include the ability to build a serverable which is the fundamental abstraction of Tensorflow serving. Servables are built using <code class="language-plaintext highlighter-rouge">SavedModelBundle</code> in Tensorflow. Servables are basically there to add flexibility to serving model in production, you can basically serve multiple models to multiple products at the same time with an instance of Tensorflow Serving, you can also use this to do some form of A/B Testing.</p>
<p>Tensorflow serving uses a gRPC protocol to connect to the server from your client application, although now they have the REST API version. gRPC is a service approach that makes use of Protocol Buffers which is a powerful binary serialization toolset. The claim for using gRPC for Tensorflow serving is that it is fast, a recent article made a comparison between using gRPC and the REST API version, you can check it <a href="https://medium.com/@avidaneran/tensorflow-serving-rest-vs-grpc-e8cef9d4ff62">here</a>.</p>
<p>Usually you will need a client to convert your data to a Protocol buffer so that it can be sent to the Tensorflow server. In this article I will be writing about how I used Flask-restPlus to build a POC to connect to Tensorflow serving via gRPC. In a latter article I will write how we migrated this to scala for production.</p>
<p>We are going to use a pretrained model i.e DeepLabv3 that google produced to build a servable and build a flask-restplus around it.</p>
<p>a. Converting DeepLabV3 to a Servable</p>
<p>Here is a simple script that prepares a frozen Tensorflow model for Tensorflow Serving model.</p>
<script src="https://gist.github.com/adekunleba/a147d1df892014f37624d5e4c699556f.js">
</script>
<p>We can then serve this model using docker by running the following simple command:</p>
<script src="https://gist.github.com/adekunleba/3b1946ba8843ac9a02d58bf77d522ed3.js">
</script>
<p>By replacing <code class="language-plaintext highlighter-rouge">path/to/model</code> with the folder containing your saved models generated from the builder, and the model name with the name of your latest model, you can run the above docker script and it will pull a tensorflow server docker and exposes the gRPC port which is 8501, which your application can connect to for serving your model.</p>
<p>In part II, I will show how flask helps to convert your incoming data request to a gRPC format for your model to be able to predict on.</p>
<p><em>I hope this article benefits someone trying to take models to production in their applications with Tensorflow Serving</em></p>
<p>.</p>Adekunle Babatundeadekunleba@gmail.comHaving worked with Machine Learning model for quite sometimes, the basic challenge has been deployment of the model in production. With this in mind google created Tensorflow Serving which is supposed to be an ideal environment for running models in production.