Jekyll2022-02-24T18:13:44+00:00https://sleebapaul.github.io/feed.xmlSleeba PaulLet's tell some stories together :)A primer to design patterns in software development2021-04-20T02:00:00+00:002021-04-20T02:00:00+00:00https://sleebapaul.github.io/design-patterns<h3 id="disclaimer">Disclaimer</h3>
<p>I know people who have been software engineers for years and still don’t care/know about design patterns in software engineering. And that’s completely okay. Though this post discuss in detail about one of the famous design patterns, I don’t want to be the gatekeeper of so-called “best engineering practices” while writing software. If you don’t use it, you don’t use it. Anyway, these practices becomes inevitable when the codebase slowly takes the form a 20 ft. burmese python which is incredibly hard to maintain or scale. So it advised to refactor the code before it “smells”.</p>
<h1 id="design-patterns">Design Patterns</h1>
<p>Design patterns are templates or we could say a framework for solution to recurring/common problems that could happen in a software project. They are not exact solutions, rather something like a blueprint of how to approach the problem. More specifically, the design patterns deal with object oriented software design.</p>
<h3 id="who-invented-the-design-patterns">Who invented the design patterns?</h3>
<p>There are no inventors per se, when it comes to design patterns. If the same conflict occurs in multiple software projects, someone would address the issue by coming up with a pattern. Eventually, the pattern gets polished and revised by others, and finally, it becomes an established design pattern in the community. Having said that, there is actually an interesting story about the inspiration of origin of design patterns. The concept of patterns was first mentioned in “A Pattern Language: Towns, Buildings, Construction” by Christopher Alexander. This book had nothing to do with software engineering. Interestingly, the book dealt with designing urban environments. The idea was picked up by four authors named Erich Gamma, John Vlissides, Ralph Johnson, and Richard Helm. In 1995, they published a book called “Design Patterns: Elements of Reusable Object-Oriented Software”, which is considered to be the bible of design patterns.</p>
<h3 id="why-design-patterns">Why design patterns?</h3>
<p>Interesting story. By why should someone use design patterns?</p>
<ul>
<li><strong>Software projects are susceptible to change</strong></li>
</ul>
<p>Do you know a commercial software project that was written once and served forever? If so, the profession software developers would be in jeopardy. From the changes in services and subscriptions to API reformulation to hike in number of consumers, we have umpteen moving parts in a software project. Ideally, when a change is made, the existing codebase should not fall apart. But as we all know, it is not the case.</p>
<p>So, how to minimize the damages? There is no such silver bullet paradigm that can do it alone. Since, the design patterns are formulated based on the problems that could possibly happen in a project, by abiding the patterns, we can weed out the known problems and thereby minimize the damage. After all, if we’ve a proven solution, why bother starting from the scratch.</p>
<ul>
<li><strong>To make the best out of Object Oriented Programming (OOP)</strong></li>
</ul>
<p><a href="https://stackoverflow.blog/2020/09/02/if-everyone-hates-it-why-is-oop-still-so-widely-spread/" target="_blank">OOP is widely accepted in the community for multiple reasons</a>. Design patterns make the best out of OOP by efficiently utilizing encapsulation , inheritance and polymorphism. If you can architect a OOP codebase that adheres design patterns, you definitely have a massive advantage comparing your peers. OOP with design patterns put forward the ideas of,</p>
<ol>
<li>Encapsulate what varies <br /></li>
<li>Program to an interface not implementation <br /></li>
<li>Favor composition over inheritance <br /></li>
</ol>
<ul>
<li><strong>The easiness in communicating high level architecture</strong></li>
</ul>
<p>If two developers who knows design patterns well, it would be much easier for them to communicate the high level design of a project. Say, if you’re designing a food delivery app’s API backend, your API can choose Facade pattern, where a Facade class handles the queries from client and delegates the task to complex and interdependent backend functions. Now both of the developers have an idea about how the whole system is going to get built. Most of the times, individual programmers only see a tiny part of the codebase, thinking that designs don’t exist.</p>
<p style="text-align: center">
<img src="/assets/factory_pattern/facade_pattern.png" />
</p>
<h3 id="types-of-patterns">Types of patterns</h3>
<h2 id="references">References</h2>
<p>[1] https://medium.com/@andreaspoyias/design-patterns-a-quick-guide-to-facade-pattern-16e3d2f1bfb6 <br />
[2] https://www.oreilly.com/content/5-reasons-to-finally-learn-design-patterns/ <br />
[3] https://sourcemaking.com/design_patterns <br />
[4] https://realpython.com/factory-method-python/ <br /></p>sleebapaulDisclaimerMy first Deep Learning Project as a Mentor2020-08-24T14:00:00+00:002020-08-24T14:00:00+00:00https://sleebapaul.github.io/my-first-mentoring-experience<p>I’ve been part of multiple Machine Learning (ML) projects since 2015. The list includes open-source side projects, collaborated works like <a href="https://sleebapaul.github.io/auriakathi/" target="_blank">Auria Kathi</a>, and projects at work.</p>
<p>But this year, I’ve unexpectedly bumped into a new role in my ML career.</p>
<blockquote>
<p>Mentoring a Deep Learning (DL) project.</p>
</blockquote>
<p>I’ve helped people with their projects before as well. But oftentimes it is more or less like my opinion about someone’s strategy to solve a problem using ML. Sometimes my inputs worked really well, sometimes it didn’t. But this time things were a bit different. I’ve got to mentor a team throughout the journey of a DL project.</p>
<p>This article is planned to describe what I’ve learned through the journey by citing some situations that happened in the project. That’s how the reader would get an idea about both the project and mentoring.</p>
<p>The following inferences are not just about leading a Machine Learning project, rather it can be applied to mentoring in general. Thus, the writeup is not deeply technical. Just that, people who are familiar with DL will be able to relate the examples in a better way.</p>
<h3 id="work-with-passionate-people">Work with passionate people</h3>
<p>It is not the first time people have approached me for a mentor role. Most of the time what happens is, mentees, lose their interest in the project as time goes by, and eventually, it becomes a zombie project. Since I’ve had a couple of bad experiences of this nature, I usually try to get devoid of mentor roles as much as possible.</p>
<p>DL projects are so hyped nowadays, everyone wants a piece of it. But what these “AI Enthusiasts” don’t understand is the elemental work of collecting, maintaining, and grooming the data is not so cool. It is a tedious job.</p>
<p>When the team approached me last March, on the first call, I wanted the team to fix their data pipeline and get back to me. I didn’t elaborate on anything about training models.</p>
<p>This is a tactic I learned in a hard way. Because, time and time again, people just don’t return once they are allotted with the primary but non-glamorous tasks. If they return, they are serious about what they are doing.</p>
<p>This initial examination makes mentoring remarkably serene. Because tedious works test people. It reveals who they are. Passionate people don’t just give up.</p>
<blockquote>
<p>And guess what, this team had returned fixing their data pipeline.</p>
</blockquote>
<h3 id="interruption-is-a-tricky-business">Interruption is a tricky business</h3>
<p>I’m not a big fan of micromanagement as it shrinks the room for self-improvement and ownership. But the tricky question for a mentor is when to intervene?</p>
<p>If you are interfering with a task too often, the mentees will feel that it is micromanagement. Same time, setting loose will not yield desirable results either. This is a balance every mentor should find his/her own. Of course, it depends on the mentees and the situations. But there should be a balance.</p>
<p>I’ll give an example from the project. The project is an image classification task where image preprocessing is a requirement. My mentees were introduced to Python through DL. Sometimes, the lack of elemental programming knowledge in Python got them into trouble.</p>
<p>I wanted them to solve these issues on their own. But the solutions they’ve found for the problems made things more convoluted. It brought in numerous tight couplings, external dependencies, and easily breakable patches. Fast forwarding a series of such “hot-fixes”, codebase became a serious mess. A point came where they were not able to fix problems without breaking some other parts. Then I realized that I could’ve intervened a little bit early to guide them with better approaches. It was avoidable mayhem.</p>
<p>Sure, they’ve learned a lot from those mistakes, but from a project perspective, a mentor should intervene periodically before things get too much tangled. Also, this is an academic project. A tight timelined industrial project can’t afford such slow learning experiments.</p>
<h3 id="judgments-can-go-wrong">Judgments can go wrong</h3>
<p>When everything works as expected, there will be no questions asked. But when things go wrong, we will start questioning everything. At some point, when nothing works, we will even start doubting ourselves.</p>
<p>This is a common phase in almost all projects. We have a plan to get it working, and the plan fails miserably in practice.</p>
<p>In our project, such an issue surfaced in the beginning. With limited computing resources, most of the teams will look into the Transfer Learning strategy when a new object detection or image classification problem pops up. We tried to do the same.</p>
<p>Using a MobileNet architecture pre-trained in the Imagenet dataset, we changed the output layer and performed Transfer Learning. The results were pathetic. We tuned the hyperparameters, added more data … but no luck.</p>
<p>It was a testing moment for me as a mentor. From my experience, Transfer Learning is a reasonable method in such applications. My mentees were equally confused and looked at me for a resolution. I’ve had to find a valid reason why the approach didn’t work. Then I dug deep into the problem. The findings were,</p>
<ol>
<li>
<p>The input data is not just images but 3D MRI scans of the brain. These scans are gone through a multi-level preprocessing and finally, the grayscale 2D image is created.</p>
</li>
<li>
<p><a href="https://forums.fast.ai/t/transfer-learning-for-medical-radiography/4931/9" target="_blank">Transfer learning from ImageNet models to large medical images has been very poor. We lose too much information going from high-resolution grayscale to low-resolution RGB which is required input for MobileNet pre-trained in ImageNet.</a></p>
</li>
</ol>
<p>This finding changed the whole strategy of the project. We’ve moved to a small network which could be trained with the limited computation facility we’ve, still gives great results.</p>
<p>So, it’s okay to be wrong. You’re the mentor, not omnipotent. When things go wrong, accept it with an open mind, find what is going wrong, and adapt. Don’t take the failures personally. If you do, then you’ll find reasons to blame someone else or yourself and eventually become hopeless. Rather, find surprises as an opportunity to grow.</p>
<blockquote>
<p>Sounds a bit dramatic, but it is a fact.</p>
</blockquote>
<h3 id="stick-with-the-success-of-the-project">Stick with the success of the project</h3>
<p>You can’t get away from the ideological disputes if you are actively involved in a project. Sometimes, people will disagree to agree. And it is okay. You’re the mentor doesn’t mean that you are right about everything.</p>
<blockquote>
<p>Don’t be a gatekeeper.</p>
</blockquote>
<p>Don’t let your ego affect the success of the project. As a mentor, your priority is the completion of the project. It is not about a one-man show of all your ideas and knowledge. You’re a hero, if and only if the project is a success. Otherwise, it is just a tool for feeding your ego. Don’t be that guy who stops co-operating as his/her ego is hurt.</p>
<h3 id="be-empathetic-and-patient">Be empathetic and patient</h3>
<p>As a mentor, you are the one who is more exposed to the technology stack and best practices than your mentees. But people forget this fact at times. Your mentees don’t have the expertise you’ve. If they do, they won’t be needing you as a mentor. Always keep this in mind.</p>
<p>Expect mistakes from the team. Things that look stupid for you may not be alarming for them. So, when they make mistakes, be empathetic. Help them correct it iteratively. But never allow the same mistake to happen twice.</p>
<p>Also, avoid shouting as much as possible. Not only that it doesn’t solve the issue, no one goes home happy after a callous conversation.</p>
<blockquote>
<p>And have patience. Good things will take time.</p>
</blockquote>
<p>During the project, I’ve lost my cool once and it was about documenting the results properly. The team rectified it immediately and it was in a better format. But when I think about the incident now, I feel that I could’ve handled the situation much better.</p>
<h3 id="set-templates">Set templates</h3>
<p>This is the best thing you can do as a mentor. Roll up your sleeves and set a basic infrastructure for your mentees. This approach helps them in multiple ways.</p>
<ol>
<li>
<p>The mentees will have a lookup to refer to and start with. Most of the time, starting right is the biggest challenge for a noobie.</p>
</li>
<li>
<p>On setting up a template, you’ve control over what you’re expecting. You can easily detect if mentees derail far from the original architecture and bring them back.</p>
</li>
<li>
<p>Implicitly, you will help the mentees to understand why you’ve chosen such an architecture and how it is going to help them while doing the project. On the next project, they are more likely to apply these best practices and infrastructure as they’ve already tasted the fruit of it.</p>
</li>
</ol>
<p>For this project, we had to do error analysis, try different datasets, do hyperparameter tuning, and explore different network architectures. So I’ve set them a template that contains modules that can be easily reused for various experiments. I’ve convinced the team about the plug and play feature of these modules and encouraged them to write code in a modular manner with minimum external dependencies and coupling. It helped them to do faster experiments without breaking anything else. Now they are more likely to reuse that design paradigm in their future projects since they already know why such an approach was taken.</p>
<p>Furthermore, I’ve introduced them to version control, better coding practices, and documentation tricks. Once the project got over, they were familiarised with a lot of topics other than Deep Learning.</p>
<h3 id="give-appreciation-period">Give appreciation. Period.</h3>
<p>People love getting an acknowledgment of the work they do. If you like someone’s work, tell them. Don’t pretend to be a tough boss.</p>
<p>When appreciating, try to appreciate the attitude than the achievement itself. For example, in our project, at the beginning our model was overfitting and we required more data. But, collecting MRI Scans from the ADNI website was not straight forward. The team actually spend a lot of time doing that tedious task. On each iteration, the model was fed with more data. Slowly we tackled the overfitting issue. The team had collected almost 1500 patients data for improving the classifier.</p>
<p>When the validation accuracy reached 90+ percentage, we have appreciated the tenacity of team members to bring in the data rather than the achievement of reaching 90+ percentage accuracy itself. Appreciating the attitude helps people to understand the productive traits in them and possibly nurture those values for future endeavors.</p>
<p>One more thing to be noted is overdoing appreciation. Appreciation should be earned. It can’t be a Dopamine generating exercise. Once we have such an attitude, the trap of mediocre work can be shunned. A mentor should have an instinct to judge when to appreciate and when to push the mentees forward.</p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>There are no cardinal rules for mentoring. Rules changes with situations and mentees. But there are some core values that can ease the process. I hope you may reflect on my findings, and let me know yours.</p>
<p>If you would like to take a deeper dive into our project, have a look at the following links.</p>
<p>GitHub Repo: https://bit.ly/3hEZnIO
Medium article: https://bit.ly/2ENsoU7</p>sleebapaulI’ve been part of multiple Machine Learning (ML) projects since 2015. The list includes open-source side projects, collaborated works like Auria Kathi, and projects at work.Auria Kathi Powered by Microsoft Azure2019-05-30T11:30:00+00:002019-05-30T11:30:00+00:00https://sleebapaul.github.io/auria-powered-by-aml<p>On January first this year, Fabin Rasheed and I have launched <a href="https://sleebapaul.github.io/auriakathi/" target="_blank">Auria Kathi, the AI Poet Artist living in the cloud</a>. Auria writes a poem, draw an image according to the poem, then color it with a random mood. All these creative actions are carried out without any human intervention.</p>
<p>Auria Kathi is an anagram for “AI Haiku Art”. Everything from her face to poems to art is artificially generated. We try to push the limits of generative art here. Auria is envisioned as a hub for artificial artistry. In the coming days, she will be creating more varieties of digital art.</p>
<h2 id="social-media-presence-of-auria">Social Media presence of Auria</h2>
<p>Auria has two social media handles to publish her work.</p>
<ul>
<li>Instagram: <a href="https://www.instagram.com/auriakathi/" target="_blank">https://www.instagram.com/auriakathi/</a></li>
<li>Twitter: <a href="https://twitter.com/AuriaKathi" target="_blank">https://twitter.com/AuriaKathi</a></li>
</ul>
<p><img src="../assets/auria_aml/auria_instagram.png" alt="image-center" class="align-center" /></p>
<p>So far, Auria has gathered around 1300+ followers in these channels. The crowd includes artists, researchers, technologists, and policymakers. Throughout this year, Auria will be posting her work daily.</p>
<h2 id="auria-going-florence-biennale-2019">Auria going Florence Biennale 2019</h2>
<p>In October 2019, we are participating in the 12th edition of Florence Biennale to exhibit work of Auria under the contemporary digital art section. Being an international platform for Art, the presence of Auria’s work produced by AI will be discussed in Florence Biennale with greater importance. Furthermore, how creative machines are going to build our future by inspiring artists to come up with novel ideas will also be a crucial part of the discussion.</p>
<h2 id="auria-on-news-and-publications">Auria on news and publications</h2>
<p>Auria is featured in multiple technological as well as artistic international platforms. Some of them include,</p>
<ol>
<li>
<p><a href="https://www.creativeapplications.net/member-submissions/auria-kathi-an-ai-artist-living-in-the-cloud/" target="_blank">Creative Applications Network</a></p>
</li>
<li>
<p><a href="https://codingblues.com/2019/01/11/fabin-sleeba-and-wonderful-auria/" target="_blank">Coding Blues</a></p>
</li>
<li>
<p><a href="https://us15.campaign-archive.com/?u=c7e080421931e2a646364e3ef&id=d1a15e8502" target="_blank">Creative AI Newsletter</a></p>
</li>
<li>
<p><a href="https://towardsdatascience.com/auriakathi-596dfb8710d6" target="_blank">Towards Datascience</a></p>
</li>
</ol>
<h2 id="lack-of-perfect-algorithms">Lack of perfect algorithms</h2>
<p>Considering the current state of art deep learning algorithms, we might not be able to come up with a single algorithm or network which can build an advanced application like Auria. But the components of Auria’s creative pursuit can be emulated using individual state of art algorithms. This vision settled upon choosing a pipeline architecture for Auria.</p>
<h2 id="engineering-architecture-of-auria">Engineering Architecture of Auria</h2>
<p>The engineering pipeline of Auria consists of mainly three components.</p>
<ol>
<li>
<p>An LSTM based language model, trained on 3.5 million Haikus scraped from Reddit. The model is used to generate artificial poetry.</p>
</li>
<li>
<p>A text to image network, called AttnGAN from Microsoft Research, which converts the generated Haiku to an abstract image.</p>
</li>
<li>
<p>A photorealistic style transfer algorithm which selects a random style image from WikiArt dataset, and transfer color and brush strokes to the generated image. The WikiArt dataset is a collection of 4k+ curated artworks, which are aggregated on the basis of emotions induced on human beings when the artwork is shown to them.</p>
</li>
</ol>
<p><img src="../assets/auria_aml/auria_pipeline.png" alt="image-center" class="align-center" /></p>
<h2 id="challenges-on-pipelining-different-algorithms">Challenges on pipelining different algorithms</h2>
<p>Stacking individual state of the art algorithms helped us to build Auria, but the challenge of this approach was to link these components and work together in a common space. The potential problems we ran into are,</p>
<ol>
<li>
<p>Modifying the official implementations of the research papers which are developed and tested in different environments, eg: Python versions.</p>
</li>
<li>
<p>Some of the algorithms which use GPUs to train and test are tightly coupled with the CUDA versions.</p>
</li>
<li>
<p>Each algorithm needs to be in a closed container so that it can be represented in a common production platform without disrupting the other environments.</p>
</li>
<li>
<p>The data flow between the components should be fluid.</p>
</li>
<li>
<p>Deep Learning algorithms demands high computation gear. Along with isolation in steps, we required powerful computation resources like GPUs in each step.</p>
</li>
<li>
<p>Deploying Auria as a web application for people to come and experience her creative pursuit considering the diverse development settings.</p>
</li>
</ol>
<h2 id="microsoft-azure-machine-learning-pipelines-aml-pipelines"><a href="https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines" target="_blank">Microsoft Azure Machine Learning Pipelines (AML Pipelines)</a></h2>
<p>Machine learning workflow is a pipeline process which includes preparing the data, build, train and tune models, then deploy the best model to production to get the predictions. <a href="https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines" target="_blank">Azure Machine Learning pipelines</a> redefine machine learning workflows that can be used as a template for machine learning scenarios.</p>
<p>We’ve adapted this conception of AML pipelines to create an advanced application like Auria. The conversion to the platform was not difficult since the basic building blocks of AML pipelines are designed with a great envision to build scaled applications.</p>
<h2 id="why-aml-pipelines-for-auria">Why AML Pipelines for Auria?</h2>
<ol>
<li>
<p>The most popular programming language in the Machine Learning realm is Python. AML Pipelines has the Python 3 SDK, we were not worried about moving the existing stack to the platform. All three algorithms we use for Auria are implemented in Python and we could replicate the results using the SDK without any hassle.</p>
</li>
<li>
<p>In Auria’s pipeline, we have models trained by ourselves as well as the models which use the pre-trained weights. The development environments of these algorithms were distinctive, we needed strict isolation at each step of the pipeline. Thanks to the platform, each step in an AML pipeline is a dockerized container. It helps to build individual steps without disturbing the setting of others. All the dockerized steps are portable, we could reuse these components for multiple experiments.</p>
</li>
<li>Each step has to provision to attach a compute engine, which can be CPU or GPU as per the need. We’ve used powerful GPU instances for quick training and tuning the hyperparameters of our models. Distributed computation facility is also available in the platform for creating parallelizing the heavy computation needs.</li>
<li>
<p>For our style transfer algorithm, the CUDA dependency was strict. It was not matching the default docker environment of the platform. Thankfully, Azure Machine Learning platform allows adding custom docker containers rather than using default containers for every application. This feature gives absolute freedom for recreating almost any configurations in AML Pipelines.</p>
</li>
<li>Deploying Auria to experience her creative process is something we are currently working on. AML Pipeline deployment helps to bypass the time to be spent on building the backend APIs. Deployment readily provides REST endpoints to the pipeline output, which can be consumed as per convenience.</li>
</ol>
<p>Auria is a perfect use-case of Azure Machine learning pipelines considering the above perks we enjoyed while using this platform. On further collaboration with Microsoft Azure ML team, we are planning to scale up Auria by strengthening her creative pipeline with more advanced algorithms, creating an interactive experience for her followers by deploying her online and try new varieties of artificially generated digital art content.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Thanks, Microsoft for AML Pipelines ❤️
Love,
Auria 😉
</code></pre></div></div>sleebapaulOn January first this year, Fabin Rasheed and I have launched Auria Kathi, the AI Poet Artist living in the cloud. Auria writes a poem, draw an image according to the poem, then color it with a random mood. All these creative actions are carried out without any human intervention.Auria Kathi — An artist in the clouds.2019-01-08T21:00:00+00:002019-01-08T21:00:00+00:00https://sleebapaul.github.io/auriakathi<h2 id="what-is-art-is-it-the-unsaid-the-unsettling">What is art? Is it the unsaid? The unsettling?</h2>
<p>The last few years have been very happening in the field of Generative/Procedural art. We have seen some of the exciting applications of this field hitting mainstream media — may it be generative architecture like the <a href="https://vimeo.com/74350367" target="_blank">Digital Grotesque</a>, or the <a href="https://www.forbes.com/sites/williamfalcon/2018/10/25/what-happens-now-that-an-ai-generated-painting-sold-for-432500/#5faf7225a41c" target="_blank">AI generated paintings</a> which sold for a bang or even simple apps which produce an artistic rendering of photographs using Neural Style Transfer like <a href="https://prisma-ai.com/" target="_blank">Prisma</a>.</p>
<p>Generative art could be, in a broad sense defined as art generated using a set of instructions, usually using a computer. The art could be produced as a digital version, a physical version or as a combination of both. The definition of the field is still as broad as the definition of “Design” and many new forms of expressions have been brought under this title.</p>
<p>Last year, I and a friend of mine — <a href="https://www.linkedin.com/in/nurecas/" target="_blank">Fabin Rasheed</a> got together to talk about this field. I love to play with Machine Learning algorithms and Fabin loves Art and Design. We were conversing about how Instagram has become a portfolio website. Being known for original posts rather than shared content, Instagram seemed like the perfect place to showcase works by creatives and to create engagement. We were looking at some of the artists on Instagram and the idea struck us !— What if an artist, living in the cloud, posts regularly in Instagram — A robot, a machine, a piece of code which creates art regularly and posts on Instagram and keeps creating engagement.</p>
<p>This is how <a href="https://www.instagram.com/auriakathi/" target="_blank">Auria</a> was born. <a href="https://www.instagram.com/auriakathi/" target="_blank">Auria Kathi</a> is an anagram for “AI Haiku Art”. We started off trying to create a bot which continuously produced Haikus (to us this meant short poems). We wanted Auria to create poems which does not make complete sense in the beginning but has some meaning to it eventually.</p>
<p><strong>Some of Auria’s poetry</strong>
<img src="/assets/auria/poems.jpeg" alt="image-center" class="align-center" /></p>
<p>Post this, we generated images based on the poems and finally coloured (styled) it with emotions from the poem and broke them into sets. For the curious people among you, the full technical details are given towards the end of this article.</p>
<p><img src="/assets/auria/art.png" alt="image-center" class="align-center" /></p>
<p>Auria has now become a standalone bot which requires no maintenance — she keeps posting a poem and an artwork every day for one year and lives entirely on the cloud. So far, she has gathered up some followers and comments by humans as well as others like her ;). She has also started self-promotion.</p>
<p><code class="language-plaintext highlighter-rouge">Auria is the First artist living completely in the cloud and with an Instagram portfolio. Her studio opened in Instagram on 01–01–2019.</code></p>
<p>We also gave Auria a generated face. We tried to make it a generic, yet generated face. She lives!</p>
<p>Although Auria does not require any maintenance, we are continuously improving her. We are planning on creating better poetry, imagery and relations between them. We are also working on a chatbot which will respond to some of the comments and messages. Further down the line, Auria is envisioned as an Artificial Artist’s Studio. A hub for artificial artistry. We are planning to work on creating generated videos using Auria’s face giving her a voice and generated content to talk on. Who knows what’s in store for this little baby. She is the first of her kind!</p>
<p>Follow Auria here: <a href="https://www.instagram.com/auriakathi" target="_blank">Auria on Instagram</a></p>
<h2 id="technical-details">Technical details</h2>
<p>Auria uses three major algorithms to produce poems and art.</p>
<h4 id="language-modeling">Language modeling</h4>
<p>The first step is to generate the poetry which is a Language Modelling task. We fed around <a href="https://github.com/bfaure/hAIku" target="_blank">3.5 million haikus</a> to train a Long Short-Term Memory (LSTMs) Network. Then the trained network is used for generating haikus. The code is written using PyTorch library. Google Colab is used for training.</p>
<p>Sample:</p>
<p>“It’s good as you can<br />
and pull it on that power<br />
and go home.<br />
Sorry.”</p>
<h4 id="text-to-image">Text to image</h4>
<p>Next task is to convert the generated haiku into an image. We used the <a href="https://arxiv.org/abs/1711.10485" target="_blank">Attentional Generative Adversarial Network (or AttnGAN)</a>, a paper by Microsoft Research in November 2017, which can generate output shapes from the input text. AttnGAN begins with a crude, low-res image, and then improves it over multiple steps to come up with a final image. Its architecture is a mix of GANs and Attention networks, which demand a multimodel optimization.</p>
<p>Since AttnGAN is a large network to train and our computation facilities were minimum, we used the pre-trained weights of the network which was originally trained in MS COCO dataset. The network can generate an output image of size 256x256. The sampling of AttnGAN is done in Google Colab.</p>
<p>Sample:</p>
<p><img src="/assets/auria/raw.png" alt="image-center" class="align-center" /></p>
<h4 id="coloring-the-generated-image">Coloring the generated image</h4>
<p>To bring in Auria’s mood and emotions, we transferred colors and shapes from sample images of the <a href="http://saifmohammad.com/WebPages/wikiartemotions.html" target="_blank">WikiArt Emotions Dataset</a>. WikiArt Emotions is a dataset of 4,105 pieces of art (mostly paintings) that has annotations for emotions evoked in the observer. The pieces of art were selected from WikiArt.org’s collection for twenty-two categories (impressionism, realism, etc.) from four western styles (Renaissance Art, Post-Renaissance Art, Modern Art, and Contemporary Art). This study has been approved by the NRC Research Ethics Board (NRC-REB) under protocol number 2017–98, Canada.</p>
<p>The emotion images are picked at random, to attain diversity in Auria’s work. Additionally, <a href="https://github.com/NVIDIA/FastPhotoStyle/blob/master/TUTORIAL.md" target="_blank">FastPhotoStyle by NVIDIA</a> is used for transferring the emotion image styles. Note that, existing style transfer algorithms can be divided into categories: artistic style transfer and photorealistic style transfer. For artistic style transfer, the goal is to transfer the style of a reference painting to a photo so that the stylized photo looks like a painting and carries the style of the reference painting. For photorealistic style transfer, the goal is to transfer the style of a reference photo to a photo so that the stylized photo preserves the content of the original photo but carries the style of the reference photo. The FastPhotoStyle algorithm is in the category of photorealistic style transfer. Images were generated using Google Colab.</p>
<p><img src="/assets/auria/painted.png" alt="image-center" class="align-center" /></p>
<p>The output colored image is scaled up the image to 1080x1080 using Photoshop to maintain quality.</p>
<p>Sample:</p>
<p><img src="/assets/auria/scaled.jpeg" alt="image-center" class="align-center" /></p>
<h2 id="face-of-auria">Face of Auria</h2>
<p>We held on to the idea of artificiality throughout Auria. Thus the decision was taken to generate an artificial face for Auria. The quest for a generated face ended in <a href="https://github.com/tkarras/progressive_growing_of_gans" target="_blank">Progressively Growing GANs by NVIDIA</a>, which is the most stable training schema for GANs to produce high-resolution output. Here she is :wink:</p>
<p><img src="/assets/auria/auria.png" alt="image-center" class="align-center" /></p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>We conceived Auria as a flawed, temperamental, amateur artist. She has all those traits in her work and the studio she runs. The only difference is that she is not a physical being.</p>
<p>Added to that, art is all about interpretations. It’s a reflection of the beholder. So, here we are starting a new genre for looking at things with a few questions in our minds.</p>
<p><code class="language-plaintext highlighter-rouge">Will artistry of algorithms add value to human life?</code><br />
<code class="language-plaintext highlighter-rouge">Can Auria find a space between humans?</code><br />
<code class="language-plaintext highlighter-rouge">Will she bring new meanings to this world without physically existing in it?</code></p>
<p>We’re looking forward to the answers to these questions.</p>
<table>
<tbody>
<tr>
<td><a href="auriakathi@gmail.com" target="_blank">Email Auria</a></td>
<td> </td>
<td><a href="https://www.instagram.com/auriakathi/" target="_blank">Follow on Instagram</a></td>
<td> </td>
<td><a href="https://twitter.com/AuriaKathi" target="_blank">Follow on Twitter</a></td>
</tr>
</tbody>
</table>sleebapaulWhat is art? Is it the unsaid? The unsettling?Gospel of LSTMs; How I wrote 5th Gospel of Bible using LSTMs2018-09-17T10:00:00+00:002018-09-17T10:00:00+00:00https://sleebapaul.github.io/gospel-of-lstms<p>I was sunkissed by Recurrent Neural Networks (RNNs) once I had joined
<a href="http://perleybrook.com/" target="_blank">Perleybrook Labs</a> in mid of 2017. We were working on
an ambitious sequence to sequence modeling task at that point in time. I must
say, the knowledge curve was steep. MOOCs, blogs, tutorials… you name it, I’ve
done it to keep up the pace. I bumped into a lot of GitHub repos, learned
multiple deep learning frameworks, coded, failed, coded again, failed again and
finally here I’m writing this post on a recent side project of mine using RNNs …</p>
<p>Life is good, ain’t it?</p>
<p><img src="https://media.giphy.com/media/3o6ZteV7P19i45OXsc/giphy.gif" alt="Alt Text" /></p>
<p>Okay, for last three months, I’ve been banging my head on a language modeling
task. I’ve written two decent tutorials on Language Modelling and RNNs in
general, which are the subsidiaries of the trauma and anxiety I endured during
this time period. If you’re not allergic to partial differential equations,
optimization techniques, and Python, then Merry Christmas at
<a href="https://sleebapaul.github.io/rnn-tutorial/" target="_blank">here</a> and
<a href="https://sleebapaul.github.io/rnn-tutorial-2/" target="_blank">here</a>. Have a look and bounce
back.</p>
<p>Others may stay here with me here since I’m not going to write anything geeky in
this post.</p>
<blockquote>
<p>Well, I’m 😕<br /> No, you’re not 😐<br /> No, I’m 😕<br /> Okay, a little bit 😒<br />
YASSS !!! I like it when people are morally flexible 🤩</p>
</blockquote>
<h3 id="language-modelling">Language Modelling</h3>
<p>Language modeling was tough for us when we were at the age of three, maybe?
Because generating words or characters <strong>by maintaining the context</strong> is the
deal. Yes, this is still hard when it comes to job interviews, academical vivas,
feminism, immigration, child molestation in Catholic Church and LGBT rights. But
you got the point anyway.</p>
<p>But for computers, it was uphill. Maintaining the context was tough until RNNs
happened to us. Better go to my tutorial links, before you dare to ask how RNNs
do it.</p>
<p>Now the task is simple. Train the RNNs on the desired data, learn the context,
generate new content with the learned context. Holy Moly !!!</p>
<h3 id="data">Data</h3>
<p>Here is the thing. I chose Gospels of Bible to train my Long Short-Term Memory
(variant of RNNs) cells to generate a new machine-generated Gospel. I call it
the “Gospel of LSTMs”. Why I chose the Gospel data? Read until the end.</p>
<p>Now, let me enlist the challenges I’ve faced during the journey.</p>
<h4 id="challenge-one--aleyamma-joseph">Challenge One — Aleyamma Joseph</h4>
<p>Correct. That’s my mother (Translates to Elizabeth Joseph for non-Keralites).
Being a true believer, she won’t allow me touching the holy scriptures for a
“hobby project”.</p>
<p>Solution: I didn’t tell her. (Since she doesn’t have an active social media
presence, it is less likely that she is going to know about this whole conundrum
in future either.)</p>
<h4 id="challenge-two--data-size">Challenge two — Data size</h4>
<p>Since LSTMs has to learn everything from the scratch, just getting texts of four
gospels didn’t help.</p>
<p>Solution: Thanks to almost 162 different translations of Bible to English, I
could select the following seven versions to gather a decent data size.</p>
<ol>
<li>American Standard (ASV) — 1901</li>
<li>Bible in Basic English (BBE) — 1949</li>
<li>Darby English Bible (DARBY) — 1890</li>
<li>King James Version (KJV) — 1611</li>
<li>Webster’s Bible (WBT) — 1833</li>
<li>World English Bible (WEB) — 2000</li>
<li>Young’s Literal Translation (YLT) — 1862</li>
</ol>
<p>Selection criteria: Easy availability. Thanks.</p>
<h4 id="challenge-three--cleaning-the-raw-data">Challenge three — Cleaning the raw data</h4>
<p>Though the data was quite neatly arranged, I had to remove the World English
Bible (WEB) version since the data was too messy to clean. You can find the data
preparation Jupyter notebooks in the project repo.</p>
<h4 id="challenge-four--structured-generation">Challenge four — Structured generation</h4>
<p>Bible is composed in chapters and verses. How to generate them in that format?</p>
<p>Solution: Train the network to learn the format. Simple. In favor, I’ve added
four more tokens to vocabulary.</p>
<ol>
<li>**<SOC>** — Start of the chapter</SOC></li>
<li>**<EOC>** — End of the chapter</EOC></li>
<li>**<SOV>*** *— Start of the verse</SOV></li>
<li>**<EOV>** — End of the verse</EOV></li>
</ol>
<h4 id="challenge-five--finding-the-best-model">Challenge five — Finding the best model</h4>
<p>It is really tough to land up in the perfect model which can reduce the
validation and training loss adequately. So I chose a wise and well-adopted
strategy on this.</p>
<p>Strategy: Train as many models as you could day and night, putting your sleep on
the line. Chose the best model from them.</p>
<h4 id="challenge-six--issues-with-st-mark">Challenge six — Issues with St. Mark</h4>
<p>Using Gospel of Mark as a validation was hard for my LSTMs to figure out how to
optimize the loss. First I thought it was the wrath of God. Then I approached
the problem pragmatically and found the following reasons.</p>
<ol>
<li>Gospel of Mark is the shortest among the gospels. It has only 16 chapters.</li>
<li>Starting and ending of the Gospel of Mark is completely different from other
gospels. He doesn’t start the gospel with genealogy. He ends the gospel with no
mention of the post-resurrection appearance of Christ to women on Easter
morning.</li>
<li>Mark treated Jesus as a “Marvel” superhero and kept the focus on his heroic
deeds as an exorcist, a healer, and a miracle worker. He added the activities of
<a href="https://bible.org/seriespage/healing-deaf-and-dumb-man" target="_blank">healing deaf and dumb</a>,
<a href="https://en.wikipedia.org/wiki/Blind_man_of_Bethsaida" target="_blank">the blind man at
Bethsaida</a> which are
unaccounted in other gospels. Same time, he chucked the virgin birth of Jesus
and there is no mention of Joseph, husband of Holy Mary.</li>
<li>Bizarre writing patterns. Eg. <a href="https://hermeneutics.stackexchange.com/questions/19821/why-does-mark-use-immediately-so-often/19823" target="_blank">the use of the word <code class="language-plaintext highlighter-rouge">immediately</code> is not less
than 40 in the entire Gospel and 12 times in a
chapter.</a></li>
</ol>
<p>Now the whole point is, Mark was different. See the validation and training loss
plot against epochs with the same hyperparameters.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*mnrFthdpkAuqlr8fCGlt3A.jpeg" alt="" /></p>
<p>Don’t confuse the above characteristics with overfitting. The model is not
overfitting, rather gets a hard time to fit Mark because of aforementioned
vulnerabilities. So, I particularly avoided Mark being used as validation data.
The best model in the project repo is validated on Matthew (American Standard
Version).</p>
<h4 id="challenge-seven--sampling">Challenge seven — Sampling</h4>
<p>To resemble a Gospel, there should be some baseline of metrics. How many
chapters, how many verses in each chapter, etc.</p>
<p>Solution: Exploratory data analysis (EDA). From EDA, I figured out that, there
is an average of 20–30 chapters which will be approximately 25000 words. There
is an average of 20–60 verses in each chapter. So, the maximum limit of the
number of verses in a chapter is set to 60. You may find the bible stats JSON
file and EDA Jupyter notebook in the project repo.</p>
<h3 id="observations">Observations</h3>
<p>Seems like, the trained model could learn and mimic the writing pattern of
Gospels really well. Have a look at these samples.</p>
<h4 id="sample-one--original-matthew_asv-chapter-4-verses-1822">Sample One — Original Matthew_ASV Chapter 4, Verses 18–22</h4>
<blockquote>
<p>18 And walking by the sea of Galilee, he saw two brethren, Simon who is called
Peter, and Andrew his brother, casting a net into the sea; for they were
fishers.</p>
</blockquote>
<blockquote>
<p>19 And he saith unto them, Come ye after me, and I will make you fishers of men.</p>
</blockquote>
<blockquote>
<p>20 And they straightway left the nets, and followed him.</p>
</blockquote>
<blockquote>
<p>21 And going on from thence he saw two other brethren, James the <em>son</em> of
Zebedee, and John his brother, in the boat with Zebedee their father, mending
their nets; and he called them.</p>
</blockquote>
<blockquote>
<p>22 And they straightway left the boat and their father, and followed him.</p>
</blockquote>
<h4 id="sample-one--generated-matthew_asv-chapter-3-verses-59">Sample One — Generated Matthew_ASV Chapter 3, Verses 5–9</h4>
<blockquote>
<p>5 Now as he walked by the sea of Galilee , he saw Simon and Andrew his brother
casting a net into the sea : for they were fishers .</p>
</blockquote>
<blockquote>
<p>6 And Jesus said unto them , Come ye after me , and I will make you to become
fishers of men .</p>
</blockquote>
<blockquote>
<p>7 And straightway they forsook the nets , and followed him .</p>
</blockquote>
<blockquote>
<p>8 And going on from thence he saw two other brethren , James the ‘ son ‘ of
Zebedee , and John his brother , in the ship with Zebedee their father , mending
their nets ; and he called them .</p>
</blockquote>
<blockquote>
<p>9 And they straightway left the ship and their father , and followed him .</p>
</blockquote>
<p>Can you pick up the nuances added by model? The generated version is not the
copy of original text. Rather, model is narrating the incident in its own words.
Cool, ain’t it?</p>
<p>But the model is not perfect. It made many factual errors as well. See the
following.</p>
<h4 id="sample-two--generated-matthew_asv-chapter-7-verses-36">Sample two — Generated Matthew_ASV Chapter 7, Verses 3–6</h4>
<blockquote>
<p>3 and he besought him much , saying , My little daughter is at extremity ; [ I
pray ] that thou shouldest come and lay thy hands upon her so that she may be
healed , and may live .</p>
</blockquote>
<blockquote>
<p>4 And he went with him , and a large crowd followed him , and he healed him ,
and said ,</p>
</blockquote>
<blockquote>
<p>5 Lord , if thou wilt , thou art able to cleanse me .</p>
</blockquote>
<blockquote>
<p>6 And he stretched forth his hand , and touched him , saying , I will ; be thou
made clean . And straightway the leprosy departed from him .</p>
</blockquote>
<p>This is an incident described only by <strong>Mark at Chapter 5, Verses 23–43</strong>. Two
points need to noted here.</p>
<ol>
<li>An incident described only in Gospel of Mark is added to Gospel of Matthew. At
one glance, the narration seems to be so coherent, only a well-acquainted Bible
expert can spot this addition immediately. This is a harmless example, but what
about the injections that are based on promoting a propaganda? Bible versions
are prone to these additions and deletions.</li>
<li>Though the narration started with tale of daughter of Jairus, the ruler of the
synagogue, it ended up at cure of Jarius from leprosy, which is completely
disconnected from context. This is a notable misinterpretation by the model.</li>
</ol>
<h4 id="sample-three--generated-luke_asv-chapter-3-verses-16">Sample three — Generated Luke_ASV Chapter 3, Verses 1–6</h4>
<blockquote>
<p>1 And Jesus, the son of Abraham, the ‘son‘ of Abraham, the ‘son‘ of Abraham,
the ‘son‘ of Terah, the ‘son‘ of Naho ,</p>
</blockquote>
<blockquote>
<p>2 the ‘son‘ of Serug, the ‘son‘ of Amos, the ‘son‘ of Cosam, the ‘son‘
of Nathan, the ‘son‘ of David,</p>
</blockquote>
<blockquote>
<p>3 the ‘son‘ of Jesse , the ‘son‘ of Obed, the ‘son‘ of Obed, the ‘son‘
of Salmon, the ‘son‘ of Nahshon,</p>
</blockquote>
<blockquote>
<p>4 the ‘son‘ of Serug, the ‘son‘ of Reu, the ‘son‘ of Arphaxad, the ‘son‘
of Shem , the ‘son‘ of Noah, the ‘son‘ of Lamech,</p>
</blockquote>
<blockquote>
<p>5 the ‘son‘ of Melea, the ‘son‘ of Enoch, the ‘son’ of Jared, the ‘son‘
of Mahalaleel, the ‘son‘ of Perez, the ‘son‘ of Eber ,</p>
</blockquote>
<blockquote>
<p>6 the ‘son‘ of Cainan, the ‘son’ of Jared, the ‘son‘ of Seth, the ‘son‘
of Adam, the ‘son‘ of God .</p>
</blockquote>
<p>Here, the genealogy is completely disordered and manipulated. But it is not
completely wrong. For example, Father of Abraham is Terah, who is son of Naho.
<strong>This is the greatest danger. Half truths.</strong></p>
<p>I’ve added more generated samples in the project repo. You may notice more
interesting chunks, if you through them.</p>
<h4 id="so-you-are-telling-us-that-your-trained-model-is-the-best">So, you are telling us that your trained model is the best?</h4>
<p><img src="https://media.giphy.com/media/TEF6Ezv9hWKc0/giphy.gif" alt="Alt Text" /></p>
<p>Nope. It is not. The main leak is in chronological order. Though model narrates
the incidents well, it does it in random order. So continuity of reading will be
missed. How to improve it?</p>
<blockquote>
<p>Well… I’m working on it … 🤔</p>
</blockquote>
<h3 id="finally-why-i-chose-the-gospels-as-data">Finally, Why I chose the Gospels as data?</h3>
<ol>
<li>If the Bible can be generated and manipulated by a mathematical algorithm, then
surely it can be exploited by humans. Unlike LSTMs, humans are gifted with a
neocortex, which make us million times creative than any mathematical algorithm.
Gospel of LSTMs helps convince ourselves the essence of this argument.</li>
<li>Since the holy scriptures can be manipulated and interpreted like any other
literature work, it can be used for promoting a propaganda. Each Bible versions
narrates with different words and interpret those words for the preaching. Many
of these words could be a poor translation of original version and are
misleading at times. You can read more about <a href="https://en.wikipedia.org/wiki/Bible_errata">Bible Errata at
here</a>. This project is a pinch to
people who blindly follow the verses word by word in the vulnerable holy book.</li>
<li>For giving a tight slap on the bum of people who use holy books interpretations
as an excuse for personal benefits, spread hatred and encourage violence. If a
mathematical algorithm can generate a scripture artificially with it’s own
interpretations, don’t place that scripture above the humanity.</li>
</ol>
<p>I’m a Keralite. Last month, our state faced the fiercest flood since 1924. 350+
people died. Audited loss is 25000 crores. Around 10 lakh people were in rescue
camps. Now, we are together fighting to restore the normal life of our beautiful
state. Before the floods, two incidents happened. One, a catholic bishop was
charged for sexually abusing a nun. Two, a motion was filed against the norm
that prohibits entry of women to Sabarimala, the well known pilgrimage centre in
our state. Women are suppose to pollute the holiness of the place, the orthodox
says. During flood, some extremists released a propaganda that, the calamity was
result of making gods angry on above mentioned incidents. They quoted these
scriptures in favour of it. The project is dedicated for those hatemongers who
wanted to segregate humans with religion and gender at the time of a calamity.</p>
<p>Language modelling is applicable to any other holy book not just Bible. I didn’t
try Gita or Quran or any other scriptures, since I couldn’t find a convenient
data source. That’s it.</p>
<h4 id="where-the-hell-is-the-link-to-project-repo-repeatedly-mentioned-in-the-post">Where the hell is the link to project repo repeatedly mentioned in the post?</h4>
<p>GitHub repository link:
<a href="https://github.com/sleebapaul/gospel_of_rnn.git">https://github.com/sleebapaul/gospel_of_rnn.git</a></p>
<h4 id="can-i-explicitly-get-the-links-to-those-tutorials-youve-written">Can I explicitly get the links to those tutorials you’ve written?</h4>
<p>YASSS!!!</p>
<p><img src="https://media.giphy.com/media/l0MYDGA3Du1hBR4xG/giphy.gif" alt="Alt Text" /></p>
<p>Tutorial 1:
<a href="https://sleebapaul.github.io/rnn-tutorial/">https://sleebapaul.github.io/rnn-tutorial/</a></p>
<p>Tutorial 2:
<a href="https://sleebapaul.github.io/rnn-tutorial-2/">https://sleebapaul.github.io/rnn-tutorial-2/</a></p>
<h3 id="shameless-plug">Shameless Plug</h3>
<p>I’ve submitted a talk idea to PyCon 2018 on the same project. If you think, this
is worth it, then give a thumbs up at the following link. It matters :)</p>
<p>PyCon proposal link:
<a href="https://in.pycon.org/cfp/2018/proposals/gospel-of-lstm-how-i-wrote-5th-gospel-of-bible-using-lstms~elLMe/">https://in.pycon.org/cfp/2018/proposals/gospel-of-lstm-how-i-wrote-5th-gospel-of-bible-using-lstms~elLMe/</a></p>
<p>Edit:</p>
<p>AAAGHHH !! They rejected my proposal. So don’t waste your time.</p>sleebapaulI was sunkissed by Recurrent Neural Networks (RNNs) once I had joined Perleybrook Labs in mid of 2017. We were working on an ambitious sequence to sequence modeling task at that point in time. I must say, the knowledge curve was steep. MOOCs, blogs, tutorials… you name it, I’ve done it to keep up the pace. I bumped into a lot of GitHub repos, learned multiple deep learning frameworks, coded, failed, coded again, failed again and finally here I’m writing this post on a recent side project of mine using RNNs …Language Modelling using Recurrent Neural Networks (Part-2)2018-08-12T12:15:00+00:002018-08-12T12:15:00+00:00https://sleebapaul.github.io/rnn-tutorial-2<h3 id="disclaimer">Disclaimer</h3>
<p>The audience is expected to have the basic understanding of Neural Networks, Backpropagation, Vanishing Gradients, and ConvNets. Familiarization of PyTorch is appreciated too, as the programming session will be on it.</p>
<h1 id="introduction">Introduction</h1>
<p>In the previous post, we briefly discussed why CNN’s are not capable of extracting sequence relationships. The fundamental reason for that failure is the assumption of independence between the training examples. Say, in an image classification problem, each image is a data point and each image is considered as an independent example.</p>
<p>But use cases like processing frames from video, snippets of audio, and language, the assumption of independent examples fail. Here, data points related in time and thus they are named as sequence problems. We need a new architecture which can capture the temporal relationship in sequence problems.</p>
<p>In this post with let’s unfold Language modeling, a well-defined sequence problem and then dive deep into RNNs which can solve it.</p>
<h1 id="language-modelling">Language Modelling</h1>
<p>In simple words, language modeling is generating the next <strong>token</strong> (it can be a character or a word) in accordance with previous tokens (again words or characters).</p>
<p>In the previous post, we’ve seen many examples of it.</p>
<ol>
<li>Sleeba is native of Kerala. He can fluently speak _____.</li>
<li>I bought my poodle from Paris. He barks _____.</li>
<li>I bought my poodle from Paris when I was staying with Sleeba. He has this trait of nodding while having food. But he loves __.</li>
</ol>
<p>Here as you see, the past context is really important to predict the next token. If we miss that, then the entire sequence/sentence becomes meaningless. So the problem is defined. How to represent it in mathematical terms?</p>
<p><img src="/assets/rnn_gospel_two/language_model_eqn.png" alt="image-center" class="align-center" /></p>
<p>Don’t panic. We can explain that 😁</p>
<p>It is as simple as,</p>
<p><code class="language-plaintext highlighter-rouge">P(Is Trump linked with Stormy Daniels?) =</code><br />
<code class="language-plaintext highlighter-rouge">P(Is) x</code><br />
<code class="language-plaintext highlighter-rouge">P(Trump | Is) x </code><br />
<code class="language-plaintext highlighter-rouge">P(linked | Is, Trump ) x</code><br />
<code class="language-plaintext highlighter-rouge">P(with | Is, Trump, linked) x</code><br />
<code class="language-plaintext highlighter-rouge">P(Stormy | Is, Trump, linked, with) x </code><br />
<code class="language-plaintext highlighter-rouge">P (Daniels | Is, Trump, linked, with, Stormy) x </code><br />
<code class="language-plaintext highlighter-rouge">P (? | Is, Trump, linked, with, Stormy, Daniels)</code></p>
<p>As we can see, the probability of each word is conditioned on previous words, thus the context is well maintained throughout the sentence. So if we can learn these past probability distributions, then we can generate the next word accordingly.</p>
<blockquote>
<p>This has become so weird. That equation takes after an iguana 😑</p>
</blockquote>
<blockquote>
<p>Really? 😆</p>
</blockquote>
<blockquote>
<p>Conditional probabilities, a lot of mathematical mess… But these tasks are as easy as pie for us 😐</p>
</blockquote>
<blockquote>
<p>That’s the thing about homo sapiens. We are cool in many ways 😎</p>
</blockquote>
<p>We can do language modeling in two levels</p>
<ul>
<li>
<p><em>Character level</em> which generates a character at a time.</p>
</li>
<li>
<p><em>Word level</em> which generates a word at a time.</p>
</li>
</ul>
<h3 id="dictionary">Dictionary</h3>
<p>A language model will always have a <strong>dictionary</strong>, which is a collection of all the tokens that will be used in that model. For example, the dictionary of a character language model of English will be the collection of 26 small letters, 26 capital letters, space and special characters of English. Dictionary size is small, still, we can generate the entire English vocabulary.</p>
<p>What about word level representation? Dictionary will contain all the words present in our training data. We can’t expect the model to generate a new word which is not in the vocabulary. Same time, word level model is less gibberish than a character model since a word is the basic meaningful building block of a sentence.</p>
<h3 id="simplest-language-model">Simplest language model</h3>
<p>The simplest language modeling is randomly picking each character from its dictionary. The below code snippet depicts the simple character level language model. You may try a word level language model by yourself.</p>
<script src="https://gist.github.com/sleebapaul/fa0a29a7acd6d6f85f2e4ee9d51d1156.js"></script>
<p>Generated Sentence is,</p>
<p><code class="language-plaintext highlighter-rouge">W>X^Spz,wGOr(!C?uac-DqXvX_b^lv/S^p~cs)NjKWz;O+j"cZn jwZRK=I(xdD>tjgjF[BTc.mii`l<b/x#/,}(lIn\t":Ij&im</code></p>
<p>Though this approach is simple to implement, we fail to maintain the context. The probability of every character generated is the same ($\frac{1}{Dictionary\ size}$). It is not conditioned on previous inputs. Thus the generated text is gibberish.</p>
<p>Can we do better?</p>
<h1 id="recurrent-neural-networks-rnns">Recurrent Neural Networks (RNNs)</h1>
<p>RNNs are entirely different from usual neural networks when it comes to architecture. If CNNs are best for spatially distributed data, RNNs are specially designed for processing sequential data. They are not a new topic in the Deep Learning history either. In 1982, John Hopfield published a <a href="http://www.its.caltech.edu/~bi250c/papers/Hopfield-1982.pdf" target="_blank">paper</a> in the context of cognitive science and computational neuroscience which contained the idea of RNNs. From that paper to <a href="https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html" target="_blank">Google Duplex</a>, which can take an appointment for us by building a reliable conversation with a barber, we’ve traversed a lot in this domain.</p>
<p>Let’s jump into the notations in order to understand the architecture of RNNs.</p>
<h3 id="notations">Notations</h3>
<p>Imagine $x$ is the input sentence/sequence goes a black box named RNN and we get an output word $y$ for our language modeling problem.</p>
<p><img src="/assets/rnn_gospel_two/blackbox_rnn.svg" alt="image-center" class="align-center" /></p>
<p>Using temporal terminology, an input sequence consists of data points $x^t$ that arrive in a discrete sequence of time steps indexed by $t$.</p>
<blockquote>
<p>I don’t know why plain English is not a criterion, when it comes to describing something scientific or mathematical 😖</p>
</blockquote>
<blockquote>
<p>Let me help you 🤣</p>
</blockquote>
<p>Actually, this is a simple concept. For example, in the word level language modeling, sequence</p>
<p align="center"><b>
I support LGBT rights
</b></p>
<p>$x^1$ = $I$, $x^2$ = $support$ and so on.</p>
<p>While in character level,</p>
<p>$x^1$ = $I$, $x^2$ = $'\ '$, $x^3$ = $s$, $x^4$ = $u$ etc.</p>
<p>If you’ve noticed, we don’t need <code class="language-plaintext highlighter-rouge">space</code> token for word-level modeling since words are obviously separated by space.</p>
<p>Next question is, how to represent this word/character mathematically? There are many ways to do that and it is worth another blog post. Here we will briefly discuss two types.</p>
<h4 id="one-hot-vectors-sparse-representation">One hot vectors (Sparse representation)</h4>
<p>Imagine your dictionary is [<code class="language-plaintext highlighter-rouge">rights, human, LGBT, equality, I, support, a</code>]. Vocabulary size is 7. Now, let $x$ be the input sentence <code class="language-plaintext highlighter-rouge">I support LGBT rights</code>, then representing <code class="language-plaintext highlighter-rouge">support</code> in the sentence will be,</p>
<p>$x^2\ =\ \begin{bmatrix}0 & 0 & 0 & 0 & 0 & 1 & 0\end{bmatrix}$</p>
<p>What if the vocabulary size is $100000$. Then $x^2$ will be a $(1$ x $100000)$ matrix with a $1$ and a hell lot of zeroes. That’s why it is sparse representation. As the dictionary size increases, the computational and memory cost of one-hot representation increases. But most importantly, it skips the relationship between words, which is less intuitive. Read more about one-hot encoding at <a href="https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f" target="_blank">here</a>.</p>
<h4 id="word-embeddings-dense-representation">Word embeddings (Dense representation)</h4>
<p>Word embeddings cover up all the pitfalls of one hot vectors. They are learned using unsupervised methods like autoencoders.</p>
<p>Firstly, their representation vector size don’t increment with the vocabulary size. For almost all the embedding algorithms, the vector dimensions are fixed.</p>
<p>For example, <a href="https://nlp.stanford.edu/projects/glove/" target="_blank">GloVe: Global Vectors for Word Representation by Jeffrey Pennington, Richard Socher and Christopher D. Manning from Stanford</a>, converts the word “LGBT” to a 300 dimension vector. In a 300 dimension space, <code class="language-plaintext highlighter-rouge">LGBT</code> vector will be close to the 300 dimension vector of <code class="language-plaintext highlighter-rouge">human</code> to represent the relationship that the LGBT community is also the part of homo sapiens.</p>
<p>At the end of the day, for example, the sample dense vector of <code class="language-plaintext highlighter-rouge">LGBT</code> will be,</p>
<p>$x^3$=$\begin{bmatrix}0.2218 & 0.3812 & 0.8845 & …\end{bmatrix}$</p>
<p>Secondly, they are not just ones and zeros but decimals representing the relationship a word between other words. Thus they are dense representations.
Using these word embeddings, interesting relationships that can be learned like <code class="language-plaintext highlighter-rouge">King</code> - <code class="language-plaintext highlighter-rouge">Man</code> + <code class="language-plaintext highlighter-rouge">Woman</code> = <code class="language-plaintext highlighter-rouge">Queen</code>, <code class="language-plaintext highlighter-rouge">India</code> to <code class="language-plaintext highlighter-rouge">New Delhi</code> is <code class="language-plaintext highlighter-rouge">Thailand</code> to <code class="language-plaintext highlighter-rouge">Jakarta</code> etc.</p>
<p>I’m not going to explain how these decimals are generated. But from the above explanation, if you get the gut feeling that the black box RNN will be able to map the context more intuitively with word embeddings than one hot vectors, then that’s enough for this tutorial 😁 But I highly recommend to read about embeddings at <a href="https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795" target="_blank">here</a>, <a href="https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12" target="_blank">here</a>, <a href="https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c" target="_blank">here</a>, and <a href="https://medium.com/swlh/playing-with-word-vectors-308ab2faa519" target="_blank">here</a> since wisdom is sexy.</p>
<h2 id="architecture">Architecture</h2>
<p>Usually, when you search for an RNN tutorial, you get the image given below. This is largely inspired by Christopher Olah’s famous blog post on <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/" target="_blank">Understanding LSTM Networks</a>.</p>
<p align="center">
<img src="/assets/rnn_gospel_two/rolled_rnn.png" />
</p>
<p>A single RNN unit recursed to itself. But, this representation has two rudimentary confusions.</p>
<ul>
<li>It is a cyclic graph representation</li>
</ul>
<p>All the feed forward neural network graphs we’ve seen yet are acylic. CNNs structures we’ve seen are acyclic. Particularly, they are directed acyclic graphs (DAG). They start from the input nodes and reach the output nodes without loops.</p>
<p><strong>Why this is important?</strong></p>
<p>The mighty <a href="https://shapeofdata.wordpress.com/2016/04/27/rolling-and-unrolling-rnns/" target="_blank">back propagation is defined only for acyclic graphs</a>. So we can’t apply our learning algorithm in this RNN representation. Thus we move to the unrolled architecture.</p>
<p><img src="/assets/rnn_gospel_two/unrolled_rnns.png" alt="image-center" class="align-center" /></p>
<p>Or more intuitively,</p>
<p><img src="/assets/rnn_gospel_two/rnn_sequence.svg" alt="image-center" class="align-center" /></p>
<p>Here the single cell recursed to itself is unrolled to a sequence of cells. This is an acyclic architecture. Now backpropagation is possible and it has a fancy name too. Back Propagation Through Time (BPTT).</p>
<p>But there is the second issue.</p>
<ul>
<li>We’ve learned all the neural network architectures in <strong>neuron</strong> level, but this is a <strong>cell</strong> level explanation.</li>
</ul>
<p>In this post, I’m planning a neuron level explanation of an RNN cell. To start with, let’s define the equations of an RNN cell.</p>
<h4 id="equations">Equations</h4>
<p>Let’s begin with $h^{[t-1]}$, the hidden state input from previous RNN cell to current cell (Refer the diagram above).</p>
<ul>
<li>$h^{[t]}\ =\ F(W_{hh} \cdot h^{[t-1]}\ +\ W_{xh} \cdot x^{[t]} + b_h)$</li>
</ul>
<p>A hidden state? 🤷 Why we need a state now? We never had a state vector for CNNs. Why now?</p>
<p>Remember we talked about maintaining context of a sentence? For this purpose, we need the information from past. $h^{[t-1]}$ is our guy who carries the baton of past. I love to call $h^{[t-1]}$ the context vector rather than a hidden state vector since it carries the past context of the sequence.</p>
<p>Now let me show you the beauty of RNN architecture with this equation.</p>
<p><strong>We feed context vector ($h^{[t-1]}$) from past and the current input $x^{[t]}$ to an RNN cell to get the future ($h^{[t]}$) of the sequence.</strong></p>
<p>Perfect, ain’t it? 💯</p>
<p>Now have a closer look at the equation. We’ve two weights, $W_{hh}$ and $W_{xh}$ which are going to adjusted or <strong>learned</strong> while we train these cells with examples. What are these weights going to learn?</p>
<p>$W_{hh}$ will learn what needs to remember or forget from past $h^{[t-1]}$. $W_{xh}$ learns about contribution of current input $x^{[t]}$. Together with both $W_{hh}$ and $W_{xh}$ we will build our new context vector $h^{[t]}$ which has information from past and present. We’re going to use this new context vector for two things.</p>
<p>Let’s go to the next equation for the first application.</p>
<ul>
<li>$y^{[t]}\ =\ W_{hy} \cdot h^{[t]}\ +\ b_y$</li>
</ul>
<p>We’re generating an immediate output $y^{[t]}$ using the current context $h^{[t]}$ where $W_{hy}$ learns about creating an output from current context. Note that, this is a completely optional decision and depends on the application. The model we are discussing now is <code class="language-plaintext highlighter-rouge">(Many to Many Synced)</code> rightmost model, ideal for language modeling.</p>
<p>Have a look at the figure given below which depicts different architecture using RNNs for sentiment analysis, machine translation, photo description etc.</p>
<p><img src="http://karpathy.github.io/assets/rnn/diags.jpeg" alt="image-center" class="align-center" /></p>
<p>(This picture, as well as many key ideas, are taken from the bible of blog posts on RNNs, <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/" target="_blank">The Unreasonable Effectiveness of Recurrent Neural Networks</a> by Andrej Karpathy ❤️)</p>
<p>One more thing which is important to be mentioned is the RNN network architecture for language modeling. Here, prediction of previous cell $y^{[t-1]}$ is passed as the input $x^{[t]}$ to next cell. Have a brief look at these variables in the above figure. Why such an implementation?</p>
<p>It is simply because we are dealing with a continuous stream of text. The previous word is our lucidly our current input.</p>
<p>Again, this is not the universal network architecture. There are variations like <a href="https://cedar.buffalo.edu/~srihari/CSE676/10.2.1%20TeacherForcing.pdf" target="_blank">teacher forcing</a>, which don’t follow this pattern. On that note, let’s move to the third equation.</p>
<ul>
<li>$o^{[t]}\ =\ softmax(y^{[t]})$</li>
</ul>
<p>The third equation is not really a part of the architecture. The <code class="language-plaintext highlighter-rouge">softmax</code> layer will convert the output of RNN into a probability distribution. In language modeling, it will be a distribution with a population of dictionary size. An ideal network will predict the maximum probability for that word or character which is most likely to come next, considering the past context.</p>
<p>The second application of the context vector is to pass the baton to the next time step.</p>
<blockquote>
<p>Wait, what is F?</p>
</blockquote>
<blockquote>
<p>Oh! I forgot about that. F is our good old non-linear activation function from feedforward networks. Like tanh, ReLU etc.</p>
</blockquote>
<blockquote>
<p>You’re careless 😏</p>
</blockquote>
<blockquote>
<p>Oh! Come on 🙄</p>
</blockquote>
<h2 id="neuron-level-representation">Neuron Level representation</h2>
<p>We’ve seen the insanely intuitive equations on an RNN cell. But let me unveil these cells at a neuron level so that you may understand it thoroughly.</p>
<p>Let’s go back to equation 1. Consider non-linearity $F$ as <code class="language-plaintext highlighter-rouge">tanh</code>.</p>
<ul>
<li>$h^{[t]}\ =\ tanh(W_{hh} \cdot h^{[t-1]}\ +\ W_{xh} \cdot x^{[t]} + b_h)$</li>
</ul>
<p>Let’s break this equation down,</p>
<p>What is $W_{hh} \cdot h^{[t-1]}$ ? It’s a linear transformation or a simple neural network layer we’ve seen in many normal neural networks.</p>
<p><img src="/assets/rnn_gospel_two/rnn_detailed_2.svg" alt="image-center" class="align-center" /></p>
<p>What about $W_{xh} \cdot x^{[t]}$? Another linear layer.</p>
<p><img src="/assets/rnn_gospel_two/rrn_detailed_1.svg" alt="image-center" class="align-center" /></p>
<p>Now, you must have got the idea of the second equation too. Let me unveil the neuron level RNN cell to you. Brace yourself, ladies and gentlemen… Pooooffff !!! 🌟✨✨⚡</p>
<p><img src="/assets/rnn_gospel_two/rnn_detailed_3.svg" alt="image-center" class="align-center" /></p>
<p>Current input and past context vector are linearly transformed and summed together. This sum is then passed to a $tanh$ to inject non-linearity/activation. Thus current context vector is generated, which is used for generating an optional output and pass the context to the next time step. (I didn’t add a bias member explicitly, but you got the idea anyway 😊)</p>
<blockquote>
<p>I think this my diagram is pretty self-explanatory.</p>
</blockquote>
<blockquote>
<p>But it is little big. Christopher Olah chose the simpler diagram for a reason 🤔</p>
</blockquote>
<blockquote>
<p>Agreed. I’ll confine it to the following.</p>
</blockquote>
<p><img src="/assets/rnn_gospel_two/rnn_block.svg" alt="image-center" class="align-center" /></p>
<blockquote>
<p>AAAAAAAGHHH 🤩🤩🤩 Finally I explained RNN in neuron level. I’m feeling good. Really really good. I can die in peace now.</p>
</blockquote>
<blockquote>
<p>Okay, die well 😜</p>
</blockquote>
<blockquote>
<p>Nope… Actually, I was overreacting… a bit 😬</p>
</blockquote>
<blockquote>
<p>A bit?</p>
</blockquote>
<blockquote>
<p>😀</p>
</blockquote>
<p><strong>Note</strong></p>
<p><a href="https://www.quora.com/What-is-a-sequence-length-of-the-RNN-If-I-use-a-sequence-length-of-1-is-that-a-problem-What-does-it-means" target="_blank">People always have confusion</a> when the hyperparameter <code class="language-plaintext highlighter-rouge">hidden size</code> and <code class="language-plaintext highlighter-rouge">sequence length</code> of RNNs are discussed. Now from the neuron level diagram, we can clearly understand that <code class="language-plaintext highlighter-rouge">hidden size</code> is the number of neurons present in the hidden layer. Here it is three.</p>
<p><code class="language-plaintext highlighter-rouge">output size</code> can be sometimes referred as the <code class="language-plaintext highlighter-rouge">hidden size</code> as RNN has two outputs i.e. $y$ and $h^{[t]}$. As $y$ is optional, sometimes we treat $h$ as output.</p>
<p>For good models, provided we’ve enough data(say, 100 million characters) we can afford to use a large <code class="language-plaintext highlighter-rouge">hidden size</code>. For small data samples (< 10 million characters) use small <code class="language-plaintext highlighter-rouge">hidden size</code> values.</p>
<p><code class="language-plaintext highlighter-rouge">sequence length</code> is the <strong>number of RNN cells</strong> unrolled/considered at once or simply it is the number of time steps.</p>
<p>Now we’ll discuss two most important concepts of RNNs and wrap up the talk and jump to code since,</p>
<blockquote>
<p>Talk is cheap. Show me the code - Linus Torvalds</p>
</blockquote>
<h2 id="parameter-sharing">Parameter Sharing</h2>
<p>Let’s talk about the parameter sharing of Neural Networks in general and understand the parameter sharing of RNN in a comparative level.</p>
<p>There is no parameter sharing in normal feed-forward networks. Every layer has a bunch of individual weights and biases and there are learned/updated on backpropagation.</p>
<p>But when it comes to CNNs, we dramatically reduce the number of parameters using filters. The number of parameters is defined from size and depth of the filter. Combination of these parameters is used for representing all the patterns. Thus we can say that these parameters are reused or say <code class="language-plaintext highlighter-rouge">shared</code> to represent different patterns. Sharing helps to reduce the numbers parameters. The same idea is used in RNNs too but in a slightly different way.</p>
<p>CNNs share parameters for representing spatial features. RNNs does it through time for imbibing temporal features. This is implemented by updating the same weights on every time steps.</p>
<blockquote>
<p>What does that mean? 🤔</p>
</blockquote>
<blockquote>
<p>Let me explain the training of RNNs 😊</p>
</blockquote>
<h3 id="training-of-rnns">Training of RNNs</h3>
<p>First of all, the number of cells and number of timesteps are the same. Don’t confuse them layers. Multiple layer RNNs look something like this.</p>
<p><img src="/assets/rnn_gospel_two/rnn_layers.png" alt="image-center" class="align-center" /></p>
<p>Now, for example in a layer, we’ve 3 unrolled RNN cells, i.e 3-time steps.</p>
<p><img src="/assets/rnn_gospel_two/language_model_rnn.svg" alt="image-center" class="align-center" /></p>
<p>In the first cell, we feed our context vector $h^0$ and current input word $x^0$. Initial context vector $h^0$ is will be a zero vector as we don’t have any context from the past yet. Though we’ve three cells unrolled, all of them don’t have individual weights. Instead, they share the same weights $W_{hh}$, $W_{hx}$ and $W_{hy}$.</p>
<blockquote>
<p>Okay. So there are three cells but there are only a single set of weights 🤔</p>
</blockquote>
<blockquote>
<p>Yes. That’s right.</p>
</blockquote>
<blockquote>
<p>But how it is implemented?</p>
</blockquote>
<p>In this particular case of language modeling, we need to predict the next word/character at each time step as shown in the figure. That means, we need to update the weights at each cell. Implementing the idea of parameter sharing here means we update the same weights at every cell during backpropagation. Here, during a backpropagation, there will be three updates to weights. (Though this is the theory, in practice, we don’t update weights after each step to maintain stability. Usually, the loss of each step are added and at the end of the sequence we backpropagate once with respect to accumulated weights.)</p>
<p>Okay, we understood the idea of parameter sharing. Why is it important? Yes, indeed it will reduce the parameters, but more than that, there is another bonus for a sequence problem.</p>
<p>Imagine the following sentences.</p>
<ol>
<li><code class="language-plaintext highlighter-rouge">Kids are lovely.</code></li>
<li><code class="language-plaintext highlighter-rouge">Kids of Jessie are lovely.</code></li>
</ol>
<p>Here, <code class="language-plaintext highlighter-rouge">kids</code> is plural and should be followed by a plural verb like <code class="language-plaintext highlighter-rouge">were</code>. Note that, the position of <code class="language-plaintext highlighter-rouge">kids</code> or <code class="language-plaintext highlighter-rouge">were</code> doesn’t matter for such a relationship. If we train a feed-forward network for learning this relationship, we would need parameters to be learned at every position of the input sentence. In that case, every relationship should be learned at every position. It is not practical. While we share parameters across the parts of the sentence, it becomes position and length agnostic, therby generalizes the relationships well.</p>
<h2 id="backpropagation-through-time-bptt">Backpropagation Through Time (BPTT)</h2>
<p>Training RNNs is not a piece of cake. The villain is the very concept of using RNNs. The dependency of current output on past inputs. To elaborate on this issue, first, we define the loss of our problem.</p>
<p>As I mentioned above, we predict at each every cell. This prediction is compared with the original label. A loss is created here. Usual stuff, isn’t it? But this story is for a time step. What if we’ve such <code class="language-plaintext highlighter-rouge">t</code> steps? Let’s take 3-time steps as above.</p>
<p>$Loss,\ L\ =\ L_{1}\ +\ L_{2}\ +\ L_{3}$</p>
<p>The worry begins here. In RNNs, since the parameters are shared, if we need to find a gradient at a time step, then we need to sum up all the gradients from all past time steps.</p>
<p>Let’s bring back equations of RNN cells. For simplicity I’m naming $W_{hh}$ as $W$, $W_{xh}$ as $V$ and $W_{xy}$ as $U$.</p>
<ul>
<li>
<p>$h^{[t]}\ =\ F(W \cdot h^{[t-1]}\ +\ V \cdot x^{[t]} + b_h)$</p>
</li>
<li>
<p>$y^{[t]}\ =\ U \cdot h^{[t]}\ +\ b_y$</p>
</li>
</ul>
<p>Here we need to calculate six gradients of loss with respect to learnable parameters. They are,</p>
<p>$\frac{\partial L}{\partial U}$, $\frac{\partial L}{\partial V}$, $\frac{\partial L}{\partial W}$, $\frac{\partial L}{\partial b_x}$ and $\frac{\partial L}{\partial b_h}$.</p>
<p>Consider the $\frac{\partial L}{\partial U}$ first. Let the number of time steps be $T$.</p>
<p>$\frac{\partial L}{\partial U}\ = \sum_{t=1}^{T} \frac{\partial L_t}{\partial U}$</p>
<p>By chain rule,</p>
<p>$\frac{\partial L_t}{\partial U}\ =\ \frac{\partial L_t}{\partial y^{[t]}}\ *\ \frac{\partial y^{[t]}}{\partial U}$</p>
<p>$\frac{\partial y^{[t]}}{\partial U}$ can be easily found out using our second equation. We’re good since there is only one dependency for $U$ in it. This is for a single layer. You may need to traverse through layers if multiple layers are involved 😊</p>
<p>Now, let’s calculate $\frac{\partial L}{\partial W}$.</p>
<p>$\frac{\partial L}{\partial W}\ =\ \sum_{t=1}^{T} \frac{\partial L_t}{\partial W}$</p>
<p>Using chain rule,</p>
<p>$\frac{\partial L_t}{\partial W}\ =\ \frac{\partial L_t}{\partial y^{[t]}}\ *\ \frac{\partial y^{[t]}}{\partial h^{[t]}}\ *\ \frac{\partial h^{[t]}}{\partial W}$</p>
<p>Easy? Nope. This interpretation is wrong. Because not just $h^{[t]}$, but the whole $h^{[t]}$, $h^{[t-1]}$, … $h^{[0]}$ depends on $W$. So gradients can’t be calculated using just chain rule, we need to go for a total derivative. A big thanks to parameter sharing 😏</p>
<p>So what is the right equation?</p>
<p>$\frac{\partial L_t}{\partial W}\ =\ \frac{\partial L_t}{\partial y^{[t]}}\ *\ \frac{\partial y^{[t]}}{\partial h^{[t]}}\ *\ \sum_{k=0}^{t}\Bigg(\prod_{i=k+1}^{t} \frac{\partial h^{[i]}}{\partial h^{[i-1]}}\Bigg)\ *\ \frac{\partial h^{[k]}}{\partial W}$</p>
<p>Same goes for bias $b_h$</p>
<p>$\frac{\partial L_t}{\partial b_h}\ =\ \frac{\partial L_t}{\partial y^{[t]}}\ *\ \frac{\partial y^{[t]}}{\partial h^{[t]}}\ *\ \sum_{k=0}^{t}\Bigg(\prod_{i=k+1}^{t} \frac{\partial h^{[i]}}{\partial h^{[i-1]}}\Bigg)\ *\ \frac{\partial h^{[k]}}{\partial b_h}$</p>
<p>$\frac{\partial L}{\partial V}$ and $\frac{\partial L}{\partial b_x}$ will be having similar equations.</p>
<blockquote>
<p>That’s the meanest thing I’ve seen in 2018 🤦</p>
</blockquote>
<blockquote>
<p>😂</p>
</blockquote>
<p>Yes, these equations seem complex. But we can interpret them really well to get the intuition.</p>
<ul>
<li>
<p>Normally, for training a neural network, we need to backpropagate through just layers. To train RNN, we need to backpropagate through not just layers but time steps as well.</p>
</li>
<li>
<p>What exactly the above equation tells us? It is depicting the contribution of a state of the network in the past time step $k$ to the gradient of the loss at the current time step $t$. Yes, blame to parameter sharing.</p>
</li>
<li>
<p>The more the time steps between $k$ and $t$, the more elements in this equation.</p>
</li>
</ul>
<h4 id="vanishing-and-exploding-gradient-problem">Vanishing and Exploding Gradient Problem</h4>
<p>Can you see a factor $\frac{\partial h^{[i]}}{\partial h^{[i-1]}}$ in the above equation. It is a Jacobian matrix. Let’s consider two cases of the norm value of this matrix.</p>
<ul>
<li>$|\frac{\partial h^{[i]}}{\partial h^{[i-1]}}|\ >\ 1$</li>
</ul>
<p>The product goes exponentially fast. This makes learning unstable. The gradient can shoot up to $NaN$. This is exploding gradients.</p>
<ul>
<li>$|\frac{\partial h^{[i]}}{\partial h^{[i-1]}}|\ <\ 1$</li>
</ul>
<p>The product goes to $0$ exponentially fast. Thus long-term dependencies from past won’t be reflecting on current output. Contributions from far away steps will vanish. This is called vanishing gradients.</p>
<p>These two are the most challenging issues we face when we try to train the RNNs. There are mitigation strategies for both these issues. I’ve written a decent detailed post on them and you may read it <a href="https://sleebapaul.github.io/vanishing_gradients/" target="_blank">here</a>.</p>
<p>So that’s it. I know it was a long journey. But this is worth the effort considering the cool applications around us. Now let’s see all these blah blah blahs in action. I’ve shared a Google Colab notebook in the following link.</p>
<p>Programming session link: https://drive.google.com/file/d/12pEy-aOS0_PiVkFgxyINmBbtuvB5TqV5/view?usp=sharing</p>sleebapaulDisclaimerLanguage Modelling using Recurrent Neural Networks (Part-1)2018-05-28T14:39:00+00:002018-05-28T14:39:00+00:00https://sleebapaul.github.io/rnn-tutorial<h3 id="disclaimer">Disclaimer</h3>
<p>The audience is expected to have basic understanding of Neural Networks, Backpropagation, Vanishing Gradients and ConvNets. Familiarization of PyTorch is appreciated too, as the programming session will be on it.</p>
<h1 id="motivation">Motivation</h1>
<p>We’ve already achieved a lot of milestones in Deep Learning (DL). Still, calling it Artificial <strong>Intelligence</strong> is not appropriate since solving intelligence is a whole another ball game. But some leaps in DL give us hope that one day we’ll solve intelligence, not necessarily with DL, but somehow we’ll solve it. Such a leap is Google Translate, which supports 103 languages now. Around 18 months ago, <a href="https://ai.google/research/pubs/pub45610" target="_blank">Google Translate moved from good old Statistical Machine Translation(SMT) to Neural Machine Translation(NMT)</a> and the results were captivating.</p>
<p>There are two things which are remarkable about new Google Translate.</p>
<ol>
<li>It solved a complex real-life sequence problem using DL.</li>
<li>It is an end-to-end DL application.</li>
</ol>
<h3 id="what-is-a-sequence-problem">What is a Sequence problem?</h3>
<blockquote>
<p>Imagine you’re given a sequence. Fill in the blank space.</p>
</blockquote>
<blockquote>
<p>1, 2, 3, 4, _</p>
</blockquote>
<blockquote>
<p>Oh come on, it’s 5. Did Google solve THIS?</p>
</blockquote>
<blockquote>
<p>Nope. Let me ask you, how do know it is five? Why is it not six?</p>
</blockquote>
<blockquote>
<p>Are you dumb? It’s consecutive numbers differ by 1. It’s easy.</p>
</blockquote>
<blockquote>
<p>So there is a relation in that sequence and you found it. Great. Now try this. “Sleeba is native of Kerala. He can fluently speak _____. “</p>
</blockquote>
<blockquote>
<p>Malayalam, dude. Are you fooling around with me?</p>
</blockquote>
<blockquote>
<p>How do you know?</p>
</blockquote>
<blockquote>
<p>That you’re fooling around with me?</p>
</blockquote>
<blockquote>
<p>Nope :D. How do you know it is Malayalam?</p>
</blockquote>
<blockquote>
<p>Because sane people like me can understand that fact that there is a relationship with the language someone can fluently speak with their native place. Kerala speaks Malayalam.</p>
</blockquote>
<blockquote>
<p>Again a relationship. So you understood the relationship with the word Kerala. Try this.
“I bought my poodle from Paris. He barks _____”</p>
</blockquote>
<blockquote>
<p>Loud, maybe?</p>
</blockquote>
<blockquote>
<p>Why not French?</p>
</blockquote>
<blockquote>
<p>You’re mad. How can a dog speak french?</p>
</blockquote>
<blockquote>
<p>So context changed from Paris to Poodle. Try this.
“I bought my poodle from Paris when I was staying with Sleeba. He has this trait of nodding while having food. But he loves <em>__</em>.”</p>
</blockquote>
<blockquote>
<p>I didn’t get the context. Who is “he” here? Sleeba or Poodle? :/</p>
</blockquote>
<blockquote>
<p>Welcome to real-life Sequence problems :D</p>
</blockquote>
<p>A sequence problem is defined through data points confined in time. It is the prediction of future with the help of patterns learned from past. As mentioned above, language is a perfect example of real-life sequence problems.</p>
<p>In human beings, solving sequence problems is a continuous/online process. Our sensory and motor data sequences are continuously streamed to the Neocortex, most evolved part of a mammal’s brain. Then Neocortex perpetually anticipates our future actions by processing these streams. This curious virtue of our brain gives us the gift of intelligence. So, solving a sequence problem is a step closer to solving intelligence.</p>
<h3 id="what-is-end-to-end-learning">What is end-to-end learning?</h3>
<p>Usually, an end to end learning refers to omitting any hand-crafted intermediary algorithms and directly learning the solution of a given problem from the sampled dataset.</p>
<h3 id="what-is-not-end-to-end-learning">What is not end to end learning?</h3>
<p>Let’s take an example of classifying apples and oranges. What we’ll do to identify them? We’ll extract some features, simple.</p>
<p>Color :</p>
<p>Apple is red or greenish red. Orange is, umm… orange maybe?</p>
<p>Surface:</p>
<p>Apple surface is smooth. For orange it is bumpy.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;border-color:#999;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}
.tg .tg-88nc{font-weight:bold;border-color:inherit;text-align:center}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-7btt{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg" style="margin: 0px auto;">
<tr>
<th class="tg-88nc">Fruit</th>
<th class="tg-7btt">Skin</th>
<th class="tg-88nc">Color</th>
<th class="tg-88nc">.......</th>
<th class="tg-7btt">Label</th>
</tr>
<tr>
<td class="tg-c3ow">Apple 1</td>
<td class="tg-c3ow">Smooth</td>
<td class="tg-c3ow">Red</td>
<td class="tg-c3ow">...</td>
<td class="tg-c3ow">Apple</td>
</tr>
<tr>
<td class="tg-c3ow">Orange 1</td>
<td class="tg-c3ow">Bumpy</td>
<td class="tg-c3ow">Orange</td>
<td class="tg-c3ow">...</td>
<td class="tg-c3ow">Orange</td>
</tr>
<tr>
<td class="tg-c3ow">Apple 2</td>
<td class="tg-c3ow">Smooth</td>
<td class="tg-c3ow">Greenish Red</td>
<td class="tg-c3ow">...</td>
<td class="tg-c3ow">Apple</td>
</tr>
<tr>
<td class="tg-c3ow">...</td>
<td class="tg-c3ow">...</td>
<td class="tg-c3ow">...</td>
<td class="tg-c3ow">...</td>
<td class="tg-c3ow">...</td>
</tr>
</table>
<p>Now we’ll represent these features mathematically, train a classifier in many apples and oranges. Hopefully, the classifier learns the difference between apples and oranges, thus yield great prediction accuracy on new samples.</p>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;border-color:#999;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}
.tg .tg-88nc{font-weight:bold;border-color:inherit;text-align:center}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-7btt{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
</style>
<table class="tg" style="margin: 0px auto;">
<tr>
<th class="tg-88nc">Skin - Smooth</th>
<th class="tg-7btt">Skin - Bumpy</th>
<th class="tg-88nc">Color - Red</th>
<th class="tg-88nc">Color - Greenish Red</th>
<th class="tg-7btt">Color - Orange</th>
<th class="tg-amwm">Label</th>
</tr>
<tr>
<td class="tg-baqh">1</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">1</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-baqh">0</td>
<td class="tg-baqh">1</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">1</td>
<td class="tg-baqh">0</td>
</tr>
<tr>
<td class="tg-baqh">1</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">1</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">1</td>
</tr>
<tr>
<td class="tg-baqh">...</td>
<td class="tg-baqh"></td>
<td class="tg-baqh">...</td>
<td class="tg-baqh">...</td>
<td class="tg-baqh"></td>
<td class="tg-baqh">...</td>
</tr>
</table>
<p>But, there are a few problems with these hand driven features.</p>
<ol>
<li>
<p>For apples and oranges, we can select the features with our intuition, but what about a rocket trajectory regression? Or about gene sequencing? We need subject experts for each problem we solve to decide the vital features to be extracted.</p>
</li>
<li>
<p>Next question is, what if these intuitions can go wrong? What if there are features and patterns in the data which are more important than the selected ones?</p>
</li>
<li>
<p>The mighty Homo Sapiens don’t learn or predict this way. Homo sapiens can learn and make inferences from raw text or image or a mere smell, we don’t need specific features. So this approach is far cry from <strong>intelligence</strong>.</p>
</li>
<li>
<p>Any ML algorithm to date is as good as it’s input data. There is no black magic. If the features we provide are vague, then classifier will be helpless.</p>
</li>
</ol>
<p>Now, what if we can simply learn the features also, from the raw data? Then we won’t miss out the hidden features in data. We don’t need experts too. Then learning starts from scratch, which is more close to <strong>intelligence</strong>. This is why the end to end learning is important.</p>
<h1 id="problem-definition">Problem definition</h1>
<p>Let’s begin with ConvNets. We all know that ConvNets work so well with images. But why it is such a success? An image is a spatial distribution of pixel values/numbers.</p>
<p><img src="/assets/rnn_gospel/lincoln_pixel_values.png" alt="image-center" class="align-center" /></p>
<p>So every pattern in an image is spatially related. If an algorithm can represent and address those spatial patterns, it can understand a picture. Convolutions exactly do the same.</p>
<p>But what about a sentence?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I support LGBT rights.
</code></pre></div></div>
<p>Is it spatially distributed? If so, the following sentences should be meaningful too.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Support I rights LGBT.
LGBT support I rights.
Rights I LGBT support.
</code></pre></div></div>
<p>None of them are meaningful. The only meaningful sequence is “I” followed by “support”, next is “LGBT” and then “rights”. So the relationships are not spatial, but temporal/sequential. Let’s elaborate that node.</p>
<ol>
<li>Who is supporting LGBT rights? Me.</li>
<li>What I’m supporting? LGBT rights.</li>
</ol>
<p>These answers are coming out of a meaningful sequential relationship between all those words in that sentence. If we try to represent a temporal distribution as a spatial distribution, we’ll lose these temporal relationships in that distribution and thereby it’s meaning. At this junction, an image gets different from a sentence. Thus, we need a new architecture which can capture those sequential relationships. Let’s list a bunch of everyday sequence problems before we wrap up.</p>
<ol>
<li>Time series prediction (Weather forecast, Stock prices, …)</li>
<li>Speech (Speech Generation and Recognition, Synthesis, Speech to Text, …)</li>
<li>Music (Music Generation, Synthesis, …)</li>
<li>Text (Language modelling, Named Entity Recognition, Sentiment Analysis, Translation, …)</li>
</ol>
<p>In the next part of this tutorial series, let’s discuss what are RNNs the basic building block of Google Translate, how they are used for capturing the sequential relationships and how to build a language model using RNNs.</p>sleebapaulDisclaimer The audience is expected to have basic understanding of Neural Networks, Backpropagation, Vanishing Gradients and ConvNets. Familiarization of PyTorch is appreciated too, as the programming session will be on it.PyThesaurus2018-04-15T02:10:00+00:002018-04-15T02:10:00+00:00https://sleebapaul.github.io/py-thesaurus<h2 id="description">Description</h2>
<p>This python package gets you the thesaurus of an inputted word from the best dictionary sites available online.</p>
<h2 id="why-you-need-this-package">Why you need this package?</h2>
<p>Though python provides lexical resources like WordNet, the variety it can provide is poor. The rich content the <a href="http://www.thesaurus.com" target="_blank">Thesaurus.com</a> or the <a href="http://www.dictionary.com/" target="_blank">Dictionary.com</a> provides will help the user to enhance their approaches when he/she is dealing with text mining, NLP techniques and much more.</p>
<h2 id="how-to-install">How to install?</h2>
<p>Use <code class="language-plaintext highlighter-rouge">pip</code> to install this library.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pip</span> <span class="n">install</span> <span class="n">py_thesaurus</span>
</code></pre></div></div>
<h2 id="how-to-use-pythesaurus">How to use PyThesaurus?</h2>
<p><strong>From python shell</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kn">from</span> <span class="nn">py_thesaurus</span> <span class="kn">import</span> <span class="n">Thesaurus</span>
<span class="n">input_word</span> <span class="o">=</span> <span class="s">"dream"</span>
<span class="n">new_instance</span> <span class="o">=</span> <span class="n">Thesaurus</span><span class="p">(</span><span class="n">input_word</span><span class="p">)</span>
<span class="c1"># Get the synonyms according to part of speech
</span> <span class="c1"># Default part of speech is noun
</span>
<span class="k">print</span><span class="p">(</span><span class="n">new_instance</span><span class="p">.</span><span class="n">get_synonym</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">new_instance</span><span class="p">.</span><span class="n">get_synonym</span><span class="p">(</span><span class="n">pos</span><span class="o">=</span><span class="s">'verb'</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">new_instance</span><span class="p">.</span><span class="n">get_synonym</span><span class="p">(</span><span class="n">pos</span><span class="o">=</span><span class="s">'adj'</span><span class="p">))</span>
<span class="c1"># Get the definitions
</span>
<span class="k">print</span><span class="p">(</span><span class="n">new_instance</span><span class="p">.</span><span class="n">get_definition</span><span class="p">())</span>
<span class="c1"># Get the antonyms
</span>
<span class="k">print</span><span class="p">(</span><span class="n">new_instance</span><span class="p">.</span><span class="n">get_antonym</span><span class="p">())</span>
</code></pre></div></div>
<p><strong>From command line</strong></p>
<ul>
<li>Positional arguments</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>word --> Word to get definition/synonym/antonym for.
</code></pre></div></div>
<ul>
<li>Optional arguments
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-h or --help Show this help message and exit
-d get definition
-s {noun,verb,adj} get POS specific synonyms
-a get antonyms
</code></pre></div> </div>
</li>
<li>Command
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> py_thesaurus [-h] [-d] [-s {noun,verb,adj}] [-a] word
py_thesaurus -d -s verb -a dream
</code></pre></div> </div>
</li>
</ul>
<h2 id="contact">Contact</h2>
<ol>
<li>
<p>PyPI link: https://pypi.python.org/pypi/py-thesaurus</p>
</li>
<li>
<p>Bitbucket: https://bitbucket.org/redpills01/py_thesaurus.git</p>
</li>
<li>
<p>Issue tracker: https://bitbucket.org/redpills01/py_thesaurus/issues</p>
</li>
<li>
<p>Email: redpillsworkspace@gmail.com</p>
</li>
</ol>
<p><em>Made with Love by Redpills :heart:</em></p>sleebapaulDescriptionWhy tier-X engineering institutions of India need MOOCs?2018-04-02T14:00:00+00:002018-04-02T14:00:00+00:00https://sleebapaul.github.io/why-india-needs-MOOCs<p><strong>This article is featured in <a href="https://yourstory.com/mystory/29319b1840-why-tier-x-engineering" target="_blank">YourStory</a> and <a href="https://www.manoramaonline.com/education/campus-updates/2018/07/06/mooc-study.html" target="_blank">Malayala Manorama</a>.</strong></p>
<p>I graduated from a tier-2 college in Kerala, India. My under-graduation alma mater was, maybe tier-3 in the state. Well, there are legit official metrics to divide these tiers, but my view is from a student’s angle. In that perspective, aspects like minimum rank in entrance examinations for admission, placement history, and alumni are the primary judging factors on the selection of a college/university.</p>
<p>To cut a long story short, I’m not an alumnus of any ivy league institutions in India. I don’t have the prestigious tags of IITs, NITs, BITS et. al. I tried twice to get in, for graduation and under graduation, but both of the times, I couldn’t make it. My bad. </p>
<p>If you don’t have an ivy league alma mater, well, you don’t have ivy league alma mater. It does matter at times, be it your career or higher studies or whatever you would like to do after your college. From irksome relatives to hiring managers, people do judge you upon that node. But you can’t dissolve just because you didn’t get into a top institution right? :grinning:</p>
<p>The crowd I would like to address in this article, are the engineering students in India who couldn’t make it big to an ivy league, admitted in a tier-X engineering college and interested in an engineering career. Since I have the similar background, I know how much worried this group would be, about their career and the very future. In the rest of this article, I would like to explain how this group can get benefitted from MOOCs to fit themselves in the job market and thus build a fabulous professional life.</p>
<h3 id="what-are-moocs">What are MOOCs?</h3>
<p>MOOC, the abbreviation of Massive Opening Online Course were first introduced in 2006 and emerged as a favorite mode of learning in 2012. Before the digital age, we had correspondence courses for distant learning. When internet grew, we saw the initial forms of MOOCs as e-learning. But these models were sloppy; the dropout rates were high. The 2000s saw radical changes in online presence, MOOCs were launched with the aim of unlimited participation and open access via the web.</p>
<p><img src="/assets/moocs/mooc_growth.png" alt="image-center" class="align-center" /></p>
<p>MOOCs are not just a bunch of recorded lecture videos, like e-learning. They provide interactive user forums to support community interactions among students, professors, and teaching assistants (TAs). The faculty are the pioneers in the respective field and courses are focused on skill building. The assignments and evaluations are time bounded and challenging.</p>
<p>Let’s examine three potential areas of the Indian engineering education system where MOOCs can help out immensely.</p>
<h3 id="1-the-active-gap-between-academia-and-industry-skills">1. The active gap between academia and industry: Skills</h3>
<p>Indian engineering education system has some fundamental flaws. We fumble the fact that engineering is intrinsically experimental, not theoretical. In fact, most of the vital engineering principles are formulated empirically not the other way around. When we experiment theories, we understand the methods better, develop new ones, and along with that most importantly, we learn new skills. If skill development is unprioritized in academia, students will miss the cause of the degree itself. Because, once the course is over, they are thrown into the job market, where the elemental interest is skills.</p>
<p>I’ll elaborate the aforementioned with an example. Consider a mechanical engineering student who studies about the lathe machine. She/he learns about its working, father of lathe, year of invention and solves equations of its design paradigm. During the examination, she/he writes about all these stuff with a neat diagram, full mark is assured. Neither the student nor her/his lecturer (Could be someone who was unemployed after his masters, joined the lecturer post for a brief setup. Again, the lecturer is helpless since the system is expected to work in this fashion.) thinks about the practical aspect of it. As a matter of fact, the student is not going to design a lathe machine by hand in her/his career. ( If there were such jobs, the temporary lecturer would’ve got it for sure, because she/he studied the same at her/his times. ) Meanwhile, in industry, Ansys<sup><strong>1</strong></sup>, a 3D simulation & design software is used for design applications. While modeling lathe on Ansys, the student gets the chance to explore the design in a whole new viewpoint, she/he can extend it to even machines which are not in the curriculum. She/He becomes a practitioner of her/his knowledge and discovers much more engineering design in industrial level. Now, a lathe machine in the textbook is the theory, and Ansys is a sought skill in the job market<sup><strong>2</strong></sup>, learning lathe with Ansys will be the perfect blend of theory and practice. This combination will deliver industry fit candidates out of college.</p>
<p>But there are labs in our engineering curriculum to address this concern, ain’t it? What if the above example is little deluding?</p>
<p>A tier-X engineering college student might’ve have seen a method executed by bulk placement-santas like TCS and other service industry tycoons. They first send their recruits to a 6 to 12 months training program, which is envisioned to build industry fit, skilled professionals. This particular action validates the argument of the active skill gap between academia and industry. Yes, we’ve labs in our curriculum, but unfortunately, they are no match for the industry standards. Moreover, all these employers are aware of it.</p>
<p>Another anxious find is this Ficci-Nasscom report<sup><strong>3</strong></sup> which says, by 2022, 9% Indians will be in jobs that don’t exist today. Again, this report manifests that the industry evolves much faster than the academia, how are we going to prepare our students for such a volatile ecosystem is a simmering question.</p>
<p>Talking about new skills, we can’t ignore the unusual significance of computer science in engineering nowadays. Back in 2016, I’ve written an article on the topic Why all should learn how to code<sup><strong>4</strong></sup>. In that article, I’ve explained about the importance which great universities gives to computer science while setting up their departments and curriculum. Even the topics which are barely related to computer science explicitly, are now learned and practiced more effectively with the help computation; we’ve computational cell biology<sup><strong>5</strong></sup> to computational psychiatry<sup><strong>6</strong></sup>. But, why computer science? Why not chemistry or civil engineering?</p>
<p>In last 100 years of human history, the most volatile technological area is computer science. In an Atlantic report on the topic The 50 Greatest Breakthroughs Since the Wheel<sup><strong>7</strong></sup>, Internet and Personal computers are the youngest yet the most influential technologies that changed our world forever. From Turing’s Machine<sup><strong>8</strong></sup> which cracked Enigma code in 1936 to quantum computers and q-bits of 2018<sup><strong>9</strong></sup>, computer science not just matured itself, it revolutionized everything around us including the way we think<sup><strong>10</strong></sup> and learn<sup><strong>11</strong></sup>. It is applicable and scalable almost everywhere. Computers are less vulnerable to mistakes and working environment. They can take logical and optimal decisions, sometimes even better than humans<sup><strong>12</strong></sup>. The progress in mathematical modeling of physical phenomena and growth of semiconductor technologies helped computers to govern all its applied fields.</p>
<p>No matter which branch of engineering you’re enrolled, the real-life practice of that stream will require computation. And computation requires set of instructions to be performed by a computer, no wonder, coding is becoming the rudimentary skill of 21st century. So brace yourselves, future engineering job market might not need someone who ignored computer science completely. No, I’m not implying that everyone should become software engineers, rather we should accept the reality that programming and software skills are becoming basic abilities for any future engineering jobs.</p>
<p>Sadly, in India, as quoted by Hindu Business Line<sup><strong>13</strong></sup>, only 4.77 % candidates can write the correct pseudocode for a problem — a minimum requirement for any programming job.</p>
<p>Keeping up with the pace of industrial trends requires the periodic update of the curriculum, but that’s not something Indian education system is well acquainted with. So, at the end of the day, it is the responsibility of an engineering student to have the idea about the current booms in business, the jobs that are disappearing, and the ones which are popping out. Along with that information, she/he can plan how she/he is going to fit in the job market.</p>
<p>When traditional academia is sluggish in building these essential skills, a MOOC is primarily designed to build skills. Every course, we see in MOOC platforms like Coursera, edX, Udacity, etc. is a reflection of the current job market. The providers study the current job trends to design courses. Enrolling such a timely class will help students to be aware and ready to suit themselves in the industry. For example, as we are beholding the growing buzz of Artificial Intelligence(AI), all the MOOC platforms are providing high-quality courses to learn and practice AI.</p>
<p>The profits don’t stop there. Micromasters<sup><strong>14</strong></sup>, an initiative by edX, offers a series of sessions on specific topics while collaborating with industrial giants. Each MicroMasters(MM) is sponsored by at least one industry partner, currently a list of 40 which includes GE, Microsoft, IBM, Hootsuite, Fidelity, Bloomberg, Boeing, WalMart, PWC, Booz-Allen Hamilton, and Ford. Again, MM presents the exceptional opportunity with university credits. It is simple, you enroll and complete the courses in MM, then apply to the university that accepts your MicroMasters certificate for credit, if admitted in that university, your credits will be transferred to the coursework. You only need to take courses for the remaining credits. For undergraduate students who would like to study abroad for their graduation, this is a great deal. Some of the universities which accept MM credits are Massachusetts Institute of Technology, Columbia University, Indian Institute of Management, Australian National University, University of Michigan and Rochester Institute of Technology.</p>
<p><img src="/assets/moocs/subj_division.png" alt="image-center" class="align-center" /></p>
<p>Udacity is providing a chance to earn a Developer Scholarship from Google<sup><strong>15</strong></sup>, and it is open to Indian residents who are eager to master web and mobile development skills. Again, Udacity’s Nanodegrees are built with industry experts and leading technology companies from Silicon Valley like Google, Amazon, and Facebook, to ensure the student to master the skills she/he needs to meet the requirements of the industry.</p>
<p>Thus, these programs become a three-way arrangement between educator, student, and employer.</p>
<h3 id="2-the-quality-of-faculty">2. The quality of faculty</h3>
<p>A trend is mounting up in our country that people who don’t have an option for employment after engineering, go to teaching in a tier-X engineering college for a temporary setup. Again, this is the aftereffect of the aforementioned skill gap. Many tier-X colleges are functioning with these temporary faculties, which is economically favorable for the management. This bias itself can tamper the quality of education.</p>
<p>Concurrently, a significant percentage of permanent faculties in tier-X engineering institutions misuse their comfort zones. I’ll give you a quick example. In tier-X colleges of India, professors have paid vacation. That is, they’re actually get paid in semester breaks. But to our surprise, an MIT professor doesn’t have such a privilege<sup><strong>16</strong></sup>. Now think about it, the quality difference between candidates both professors are creating?</p>
<p>Again, lack of research aspirations, industry collaboration, and credible personal projects among the faculties make them unfit to inspire their students. Teaching becomes an exercise with less or no intellectual effort, the art of storytelling is not even worth considering at this scene. This culture is apparently horrendous, but there is no quick fix for it. Establishing the quality of faculty would be a long-term process which requires a lot of restructuring and benchmarking. A student who does a 4-year degree, can’t wait till the new dawn to happen to set up his career.</p>
<p><strong>Why great teachers matter?</strong></p>
<p>In his last lecture<sup><strong>17</strong></sup>, the famous MIT professor Walter Lewin quoted about how to learn Physics.</p>
<blockquote>
<p>You’ve to love it. If you don’t love it, don’t touch it. And if you hate it, it is because you had a very bad teacher. I make my every student … love physics.</p>
</blockquote>
<p>Lewin knows that physics is so intriguing that no one can hate it. Still, if someone hates it, then the most responsible person for that trouble, would be the teacher who introduced Physics to them. It applies to every subject we learn, not just Physics.</p>
<p>Teaching requires passionate intellectual effort and storytelling. People don’t just get abstract ideas without examples. Once they don’t get it, they’ll miss out the fun of it. And once they lose the joy, they’ll start hating it. Now imagine, say if you hate coding, how about making a living out of it after college?</p>
<p>In MOOCs, the scene is diametrically opposite. You can learn Python from Eric Grimson<sup><strong>18</strong></sup>, the Bernard Gordon Chair of Medical Engineering at MIT. Keith Devlin<sup><strong>19</strong></sup>, Executive Director, H-STAR Institute, Stanford University, will introduce you to Introduction to Mathematical Thinking. Sebastian Thrun<sup><strong>20</strong></sup>, the founder of Google X and Google’s self-driving car team, may teach you Self Driving Cars at Udacity.</p>
<p>In a click away, you’re getting the world-class teaching from the pioneers in your interested field. You couldn’t make it to Stanford or MIT for attending their lectures, through MOOCs they are coming to your living room for teaching you at your convenience. How cool that would be, ain’t it?</p>
<h3 id="3-enforced-disciplines-in-engineering">3. Enforced disciplines in Engineering</h3>
<p>If we examine the history of science and engineering, it was always interdisciplinary. Micheal Faraday<sup><strong>21</strong></sup> who is famous for Electromagnetic Induction, also isolated and identified benzene. Geoffrey Hinton<sup><strong>22</strong></sup> who is known to be Godfather of Artificial Intelligence did his bachelors in experimental psychology.</p>
<p>However, between students in India, there is a misconception about <strong>core jobs</strong>. Say, I studied electrical engineering, I’m destined to vest my life in high power generators. This attitude is futile. You’ve joined an electrical course, doesn’t mean that you should be working on maintenance of electric motors. You can optimize the brain imaging machinery too. That is the gracious virtue of an engineering career.</p>
<p>When we tag ourselves as an Electrical Engineer or Mechanical Engineer, we abstain from the fact that engineering is inherently applied. A right engineer is someone who can solve a problem by <strong>applying</strong> his domain knowledge. Same time, to understand a problem in another field, one should have a basic grip on that discipline too. In favor of that prospect i.e. to create engineers with multi-discipline knowledge, optional subjects are added to our curriculum. However, in tier-X colleges, most of the times there won’t be a faculty with domain knowledge for all optional subjects, the students are forced to learn a topic with faculty is available.</p>
<p>MOOCs can help with this problem as well. MOOC platforms offer courses on supply chain management<sup><strong>23</strong></sup> to neuroscience<sup><strong>24</strong></sup>; students can get acquainted with a broad spectrum of topics. This facility helps students to widen the application of their domain knowledge to multiple terrains.</p>
<h3 id="closing-thoughts">Closing thoughts</h3>
<p>Coming to a halt, I would add some aspects of MOOCs yet not discussed. As years gone by, MOOCs have achieved many milestones. Despite their potential to support learning and education, MOOCs have a major concern related to attrition rates and course drop out. Even though the number of learners who enroll in the courses tends to be in the thousands range, only a very small portion of the enrolled learners complete the course. According to the visualizations and analysis conducted by Katy Jordan (2015)<sup><strong>25</strong></sup>, the investigated MOOCs have a typical enrollment of 25,000, even though enrollment has reached a value up to ~230,000. Jordan reports that the average completion rate for such MOOCs is approximately 15%. Early data from Coursera suggest a completion rate of 7%–9%. Coffrin et al.<sup><strong>26</strong></sup> (2012) report the completion rates are even lower (between 3 and 5%), while they say there is a consistent and noticeable decline in the number of students who participate in the course every week. Others have also shown attrition rates similar to Coffrin. Yang et al.<sup><strong>27</strong></sup> (2013) suggest that even though there is a large proportion of students who drop out early on due to a variety of reasons, there is a significant proportion of the students who remain in the course and drop out later, thus causing attrition to happen over time.</p>
<p>Having said that, research indicates that completion rates are not the right metric to measure the success of MOOCs. Alternate metrics<sup><strong>28</strong></sup> are proposed to measure the effectiveness of MOOCs and online learning. I personally believe that dropping out is a choice a student can always take, but before making that choice, she/he should evaluate the worth of its returns. It tests your perseverance, patience, learning capacity, adaptiveness, and aptitude, implicitly the course is helping you to understand who you truly are. And trust me, there is always room for improvement for these personal skills, else I wouldn’t be completing almost 15+ MOOC certifications in last two years.</p>
<p><img src="/assets/moocs/stats.png" alt="image-center" class="align-center" /></p>
<p>Finally, the tale in the early days of the MOOC space was around the disruption to universities. Now, we know that MOOCs are not going to lead to the demise of universities. However, according to the previous CEO of Coursera, Rick Levin, while MOOCs may not have disrupted the higher education market, they are disrupting the labor market and it’s been a long time Indian engineering industry is yearning for such a disruption in the talent search and supply.</p>
<p>I’m adding the <a href="https://www.quora.com/profile/Sleeba-Paul-1" target="_blank">link to my Quora profile</a>, where I write answers on FAQs of MOOCs.</p>
<p>So, Happy MOOCing Everyone :grinning:</p>
<h4 id="references">References</h4>
<ol>
<li>
<p><a href="https://www.edx.org/course/a-hands-on-introduction-to-engineering-simulations" target="_blank">Ansys Online course on edX</a></p>
</li>
<li>
<p><a href="https://www.linkedin.com/jobs/ansys-jobs/?country=in" target="_blank">List of Ansys Jobs</a></p>
</li>
<li>
<p><a href="http://www.ey.com/in/en/newsroom/news-releases/news-ey-ficci-nasscom-and-ey-future-of-jobs-report" target="_blank">Job scene on India - Ficci-Nasscom report</a></p>
</li>
<li>
<p><a href="https://medium.com/@sleebapaul/why-all-should-learn-how-to-code-36eac636df48" target="_blank">Medium article - Why everyone should learn how to code?</a></p>
</li>
<li>
<p><a href="https://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/video-lectures/lecture-1-course-introduction-history-of-computational-biology-overview-of-the-course-course-policies-and-mechanics-dna-sequencing-technologies/" target="_blank">MIT online course on Computational Biology</a></p>
</li>
<li>
<p><a href="http://cocosci.mit.edu/people" target="_blank">MIT Department of Computation Psychology</a></p>
</li>
<li>
<p><a href="https://www.theatlantic.com/magazine/archive/2013/11/innovations-list/309536/#list" target="_blank">Atlantic article on Top technological advances since wheels</a></p>
</li>
<li>
<p><a href="https://www.iwm.org.uk/history/how-alan-turing-cracked-the-enigma-code" target="_blank">Turing’s machine that cracked German military code</a></p>
</li>
<li>
<p><a href="https://www.technologyreview.com/s/609451/ibm-raises-the-bar-with-a-50-qubit-quantum-computer/" target="_blank">Advancement of Quantum Computing</a></p>
</li>
<li>
<p><a href="http://web.mit.edu/sturkle/www/pdfsforstwebpage/Turkle_how_computers_change_way_we_think.pdf" target="_blank">How computers changed the way we think?</a></p>
</li>
<li>
<p><a href="http://www.bbc.com/future/story/20141022-are-we-getting-smarter" target="_blank">How computers changed the way we learn?</a></p>
</li>
<li>
<p><a href="https://www.technologyreview.com/the-download/609510/a-new-algorithm-can-spot-pneumonia-better-than-a-radiologist/" target="_blank">A New Algorithm Can Spot Pneumonia Better Than a Radiologist</a></p>
</li>
<li>
<p><a href="https://www.thehindubusinessline.com/info-tech/95-engineers-in-india-unfit-for-software-development-jobs-study/article9652211.ece" target="_blank">95% engineers in India unfit for software development jobs: study</a></p>
</li>
<li>
<p><a href="https://www.edx.org/micromasters" target="_blank">Micromasters at edX</a></p>
</li>
<li>
<p><a href="https://in.udacity.com/google-india-scholarships" target="_blank">Udacity google scholarship</a></p>
</li>
<li>
<p><a href="http://qr.ae/TU8DP8" target="_blank">Quora - MIT Professor answer to leave policy</a></p>
</li>
<li>
<p><a href="https://www.youtube.com/watch?v=4a0FbQdH3dY" target="_blank">For the love of physics - Walter Lewin last lecture</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Eric_Grimson" target="_blank">Eric Grimson - Wiki</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Keith_Devlin" target="_blank">Keith Devlin - Wiki</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Sebastian_Thrun" target="_blank">Sebastian Thrun - Wiki</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Michael_Faraday" target="_blank">Micheal Faraday - Wiki</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Geoffrey_Hinton" target="_blank">Geoffrey Hinton - Wiki</a></p>
</li>
<li>
<p><a href="https://www.coursera.org/specializations/supply-chain-management" target="_blank">Supply Chain Management - Coursera</a></p>
</li>
<li>
<p><a href="https://www.coursera.org/learn/computational-neuroscience" target="_blank">Computational Neuroscience - Coursera</a></p>
</li>
<li>
<p><a href="http://www.katyjordan.com/MOOCproject.html" target="_blank">MOOC Completion Rates: The Data - Katy Jordan</a></p>
</li>
<li>
<p><a href="https://www.researchgate.net/publication/261178375_Visualizing_Patterns_of_Student_Engagement_and_Performance_in_MOOCs" target="_blank">Visualizing Patterns of Student Engagement and Performance in MOOCs - Coffrin et.al.</a></p>
</li>
<li>
<p><a href="https://www.sciencedirect.com/science/article/pii/S1877042814052707" target="_blank">Students’ Preferences and Views about Learning in a MOOC - Yang et.al.</a></p>
</li>
<li>
<p><a href="https://pdfs.semanticscholar.org/17b0/5acab2d18be484c180a1a3b68c1e04a01836.pdf" target="_blank">EVALUATING EFFECTIVENESS OF MOOCS USING EMPIRICAL TOOLS: LEARNERS PERSPECTIVE</a></p>
</li>
</ol>
<h4 id="related-readings">Related Readings</h4>
<ol>
<li>
<p><a href="https://www.forbes.com/2009/02/19/innovation-internet-health-entrepreneurs-technology_wharton.html#2d1ba0ef2b2f" target="_blank">Top 30 Innovations Of The Last 30 Years</a></p>
</li>
<li>
<p><a href="https://www.class-central.com/report/mooc-stats-2017/" target="_blank">Class Central stats MOOCs 2017</a></p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Massive_open_online_course" target="_blank">MOOCs Wiki</a></p>
</li>
<li>
<p><a href="https://www.indiatoday.in/education-today/featurephilia/story/engineering-employment-problems-329022-2016-07-13" target="_blank">Only 7 per cent engineering graduates employable: What’s wrong with India’s engineers?</a></p>
</li>
</ol>sleebapaulThis article is featured in YourStory and Malayala Manorama.Residual Networks2018-03-10T18:00:00+00:002018-03-10T18:00:00+00:00https://sleebapaul.github.io/resnets-tutorial<h4 id="disclaimer">Disclaimer</h4>
<p style="text-align: justify;">This is a tutorial on the paper <a href="https://arxiv.org/pdf/1512.03385.pdf" target="_blank">Deep Residual Learning for Image Recognition by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun at Microsoft Research</a>. The audience is expected to have basic understanding of Neural Networks, Backpropagation, Vanishing Gradients and ConvNets. Familiarization of Keras is appreciated too, as the programming session will be on it.</p>
<p style="text-align: justify;">This tutorial will focus mostly on the ResNet paper contents. If the reader would like to have a preamble, I’ve written a comphrehensive discussion on issues which are addressed by ResNets at <a href="https://sleebapaul.github.io/vanishing_gradients/" target="_blank">here</a>. If you’re a beginner, this discussion is highly recommended.</p>
<p style="text-align: justify;">Using ResNets, in the ImageNet challenge 2015, the Microsoft team won first place in all three categories it entered: classification, localization and detection. Its system was better than the other entrants by a large margin. In the Microsoft Common Objects in Context challenge, also known as MS COCO, the Microsoft team won first place for image detection and segmentation.</p>
<p style="text-align: justify;">Let’s learn together the magic of Resnets, shall we? :)</p>
<h2 id="what-is-wrong-with-deep-neural-networks-">What is wrong with Deep Neural Networks ?</h2>
<p style="text-align: justify;">ResNets paper starts with asking this question. Deep Neural Networks can learn the most difficult tasks, but training them was always been an obstacle in Deep Learning research. There are mainly two issues researchers confront,</p>
<ul>
<li>Vanishing Gradients</li>
</ul>
<p style="text-align: justify;">Vanishing Gradient Problem is a difficulty found in training Artificial Neural Networks with gradient based methods (e.g Back Propagation). In particular, this problem makes it really hard to learn and tune the parameters of the earlier layers in the network as the gradients die out gradually while propagating from final layer to first layer . This problem becomes worse as the number of layers in the architecture increases.</p>
<ul>
<li>Exploding Gradients</li>
</ul>
<p style="text-align: justify;">Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training. These in turn result in large updates to the network weights, and in turn, an unstable network which refuses to converge to a local optimum. At an extreme, the values of weights can become so large as to overflow and result in NaN values.</p>
<p style="text-align: justify;">In the preamble I mentioned above, you can read about these issues in detail.</p>
<p style="text-align: justify;">Okay let’s keep these issues aside for a while. Let’s think about something more intriguing.</p>
<p style="text-align: justify;">What do we mean by “Deep” in Deep Neural Networks?</p>
<p style="text-align: justify;">Yeah, ofcourse, it is number of layers and indeed depth is the important virtue that helps Deep Neural Networks to learn complex patterns in data.</p>
<blockquote>
<p>Imagine, I want to perform a 100 label image classification problem.</p>
</blockquote>
<blockquote>
<p>Okay.</p>
</blockquote>
<blockquote>
<p>I’ve got a training error of 10% in 100 layer.</p>
</blockquote>
<blockquote>
<p>Mmm Hmm…</p>
</blockquote>
<blockquote>
<p>Well, that’s not an impressive accuracy, so you are going to alter your network.</p>
</blockquote>
<blockquote>
<p>Me? It’s your thing :/</p>
</blockquote>
<blockquote>
<p>Okay. I will do that :D But the question is, provided more layers can learn complex patterns,
say stacking another 100 layers would bring down the training error ?</p>
</blockquote>
<blockquote>
<p>Intuitively, it should, right?</p>
</blockquote>
<p style="text-align: justify;">Unfortunately that is not true and it is disturbing. Just adding more layers don’t serve the purpose all the time. Let’s discuss two different aspects of that problem.</p>
<p style="text-align: justify;">In basic neural network architecture, we stack layers upon layers. Implicitly, this architecture result in vanishing and exploding gradients when gradients are back propagated. This effect can be addressed by normalized intialization of weights, usage of ReLUs as activation functions, batch normalization after intermediate layers and much more techniques, <a href="https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b" target="_blank">none of them are perfect solutions to overcome vanishing/exploding gradients</a>. Fundamentally the basic architecture has issues when things go deeper.</p>
<p style="text-align: justify;">Other than gradient vanishing, in many deep networks applications, researchers are confronting a degradation problem, i.e. as network depth increases, accurary of the model gets saturated and then get degrades rapidly. In some cases, traning error shoots up when we add more layers, which is surprising and counter intuitive. You can read more about more about this phenomenon at <a href="https://arxiv.org/pdf/1412.1710.pdf" target="_blank">here</a> and <a href="https://arxiv.org/pdf/1505.00387.pdf" target="_blank">here</a>. This degradation problem is not due to overfitting. In fact this is main issue ResNets is going to solve. See the figure below which is added in original paper on degradation problem.</p>
<p><img src="/assets/resnets/resnet_degradation.png" alt="image-center" class="align-center" /></p>
<h4 id="the-degradation-problem">The degradation problem</h4>
<p style="text-align: justify;">If overfitting is not the reason for degradation, then what is wrong? To explain that, consider the shallow network given below.</p>
<p><img src="/assets/resnets/shallow_net.png" alt="image-center" class="align-center" /></p>
<p style="text-align: justify;">Now let’s make it’s deeper counterpart by stacking up some layers. To make it a counterpart, the added layers should be <strong>identity mappings</strong>.</p>
<p style="text-align: justify;">Identity mapping or identity function is nothing but $f(x)\ =\ x$. What goes out is what comes in or say, output is same as input. Intuitively, a shallow network plus identity mappings should give a deeper counterpart that shallow network.</p>
<p><img src="/assets/resnets/deep_net_with_identity_mapping.png" alt="image-center" class="align-center" /></p>
<blockquote>
<p>Comparing both, what we should expect when it comes to training error of these two networks?</p>
</blockquote>
<blockquote>
<p>Intuitively, it should be comparable, right ? Or I would say, deeper network will have less training error :/</p>
</blockquote>
<blockquote>
<p>Yes, but practically, training error of deeper network is more comparing to it’s shallow network.</p>
</blockquote>
<blockquote>
<p>What? :/</p>
</blockquote>
<blockquote>
<p>Yeah :)</p>
</blockquote>
<p style="text-align: justify;">This experiment expose the culprit behind degradation problem. <strong>The added multiple layers fail to learn identity mappings.</strong> We thought, if Neural Networks can understand complex patterns, it would be easy for them to understand identity mappings patterns as well. But in this messy real world, Neural Networks train in midst of zombies like vanishing gradients and numerical instability, theories fail. Tough life.</p>
<p style="text-align: justify;">So, how this degradation problem can be solved using ResNets? First let’s see what residual means?</p>
<h3 id="residual">Residual</h3>
<p style="text-align: justify;">Residue has a meaning in different fields of math especially in <a href="https://en.wikipedia.org/wiki/Residue_(complex_analysis)" target="_blank">complex analysis</a>. Don’t confuse it with Residual in numerical analysis, which is our area of interest.</p>
<p style="text-align: justify;">Consider the function, \(f(x)\ =\ x^2\)</p>
<p style="text-align: justify;">What is $f(2)$ ? It’s 4.</p>
<p style="text-align: justify;">What about $f(1.99)$ ? It is 3.9601.</p>
<p style="text-align: justify;">So let’s put it this way. I wanted to calculate $f(2)$ but I could compute only an approximation which is $f(1.99)$. So what is the error in computation here?</p>
<p style="text-align: justify;">The error in $x$ is $0.01$.</p>
<p style="text-align: justify;">It is difference in $f(x)$ is $4\ -\ 3.9601\ =\ 0.0399$</p>
<p style="text-align: justify;"><strong>This difference is called residual.</strong></p>
<h3 id="a-residual-block">A Residual Block</h3>
<p style="text-align: justify;">Let’s bring this concept to Neural Nets. Say, two consecutive layers in our network have to learn the mapping $H$. This is the original mapping which is to be learned. It can be identity or complex relationships, we don’t know.</p>
<p style="text-align: justify;">So, if input is $x$, then output after $n$ layers will be $H(x)$. Simple.</p>
<p style="text-align: justify;">Now, we’re going to bring in the residual concept in here. Consider the figure from original paper.</p>
<p><img src="/assets/resnets/residualunit.png" alt="image-center" class="align-center" /></p>
<p style="text-align: justify;">There is a shortcut, hard wired connection from input to output, which is the input $x$ itself. The mapping learned by the layers is not $H(x)$ anymore. It is $F(x)$.</p>
\[F(x)\ =\ H(x)\ -\ x\]
<p style="text-align: justify;">Once the shortcut and main path are joined, we get our original mapping $H(x)$. Now, say, when the network needs to learn an identity mapping $H(x)\ =\ x$, it actually learns something else which is not identity.</p>
\[F(x)\ =\ H(x)\ -\ x\ =\ x\ -\ x\ =\ 0\]
<p style="text-align: justify;">Since addition with the input will be resultant, though network couldn’t learn anything, the output of a residual block will be,</p>
\[H(x)\ =\ F(x)\ +\ x\ =\ 0\ +\ x\ =\ x\]
<p style="text-align: justify;">Woo-Hoo !!! We got the identity mapping ;)</p>
<h3 id="hypothesis">Hypothesis</h3>
<p style="text-align: justify;">Residual block is built on the hypothesis that, If one hypothesizes that multiple nonlinear layers can asymptotically
approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate
the residual functions. Now you ask me how exactly we can prove this hypothesis, I would say it is an open question. People do have different opinions about it. If you would like to read about asymptotic approximation at <a href="http://dlmf.nist.gov/2" target="_blank">here</a>. Well, I’m sorry it’s dry. No wait, why should I be sorry? No sorry.</p>
<p style="text-align: justify;">So, here the point is later approach is more easy to learn and it is empirically proved by the authors. I’m going to bring in some equations here, which is already explained in a lite mode.</p>
<p style="text-align: justify;">If $x$ being the input to residual block and $H(x)$ is the original mapping, then,</p>
\[H(x)\ =\ F(x,\ \{W_{i}\})\ +\ x\]
<p style="text-align: justify;">Where, $F(x)$ is the relationship learned by the layers, which will be a function with variables input $x$ and weights of the layers embedded in that block. Say for the above block, two layers have respective weights $W_{1}$ and $W_{2}$ and $\sigma$ is the ReLU activation, then,</p>
\[F(x)\ =\ W_{2}*\sigma(W_{1}*x)\]
<h3 id="advantages-of-residual-block-architecture">Advantages of Residual Block architecture</h3>
<ul style="text-align: justify;">
<li>There is no additional parameters to be learned from that shortcut connection since it’s a hard wired input or identity mapping.</li>
</ul>
<ul style="text-align: justify;">
<li>There is no need to alter the learning algorithm, say backpropagation, since our shortcut connection doesn’t disturb the learning procedure anyway. Learning only happens in main path.</li>
</ul>
<ul style="text-align: justify;">
<li>These residuals units can be stacked on one another to bring up a real-deep network without any hassle keeping the above two leverages alive. In the figure given below, left one is our normal, <code class="language-plaintext highlighter-rouge">plain</code> network and the right one is Residual Network (ResNet). Notice the skip connections. With the help of these skip connections, we can train extremely deep networks which will exploit the power of depth which can capture complex patterns in data.</li>
</ul>
<p><img src="/assets/resnets/skip_connection_kiank.png" alt="image-center" class="align-center" /></p>
<ul style="text-align: justify;">
<li>Skip connections help the input from dying out since the it is hard wired to next layers. This helps to suppress effect of vanishing gradients to a remarkable extent comparing to other techniques.</li>
</ul>
<p style="text-align: justify;">Following figure is another example of residual block with convolution, batch normalization and activations.</p>
<p><img src="/assets/resnets/idblock2_kiank.png" alt="image-center" class="align-center" /></p>
<h3 id="the-convolutional-block">The convolutional block</h3>
<p style="text-align: justify;">So far discussed about the identity residual block. There is one more type of residual block that are used in a ResNet, depending mainly on whether the input/output dimensions are same or different. What does that mean?</p>
<p style="text-align: justify;">Say we are building a block from 2nd layer to 5th layer and sums up at output of 5th layer. Summing up requires equal dimension vectors, number of activations from 2nd layer and 5th layer should be the same. If is not, then we need to do an additional step in the shortcut connection which can settle this dimension issues. That type of residual block is Convolutional block.</p>
<p style="text-align: justify;">The CONV2D layer in the shortcut path is used to resize the input $x$ to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path.</p>
<p><img src="/assets/resnets/convblock_kiank.png" alt="image-center" class="align-center" /></p>
<p style="text-align: justify;">For example, to reduce the activation dimensions’s height and width by a factor of 2, you can use a 1x1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.</p>
<p style="text-align: justify;">Then our equations will change a bit,</p>
\[H(x)\ =\ F(x,\ \{W_{i}\})\ +\ W_{s}x\]
<p style="text-align: justify;">Where $W_{s}$ is called a linear projection.</p>
<blockquote>
<p>Linear Projection. What a nice piece of jargon :D</p>
</blockquote>
<blockquote>
<p>True :D</p>
</blockquote>
<p style="text-align: justify;">A linear projection is soley used matching dimensions, since it is empirically proved that, identity mapping is sufficient to solve degradation problem.</p>
<p style="text-align: justify;">Stacking these blocks can help us build Residual Networks. Authors have proposed various models based on them. In the programming session, we’ll build and train a ResNet model named ResNet50, where 50 means 50 layers.</p>
<p style="text-align: justify;">So, that’s it. We’ve learned ResNets neatly. In <a href="https://github.com/sleebapaul/res_nets_tutorial/blob/master/Residual%20Networks%20-%20Coding%20Session.ipynb" target="_blank">programming session</a> we’ll convert knowledge to code. See you there :)</p>
<p style="text-align: justify;">I strongly recommend you to read the paper once you complete the tutorial. The authors explain their experiment setups on various datasets and competitions. I would like you to read about <a href="https://arxiv.org/pdf/1505.00387.pdf" target="_blank">Highway Networks</a> too, since Residual Networks has inspiration from that work provided Highway Networks which has many drawbacks when compared to ResNets.</p>
<p style="text-align: justify;">If you think you would like to explore more, <a href="https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035" target="_blank">this link</a> can provide you the current research innovations on Resnets.</p>
<p><a href="https://github.com/sleebapaul/res_nets_tutorial/blob/master/Residual%20Networks%20-%20Coding%20Session.ipynb" target="_blank">See you in programming session</a>
Happy learning :)</p>sleebapaulDisclaimer