All Notes

Intro. to CNN

https://www.geeksforgeeks.org/introduction-convolution-neural-network/

In a regular Neural Network there are three types of layers:

1. Input Layers: It's the layer in which we give input to our model. The number of neurons in this layer is equal to the total number of features in our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can be many hidden layers depending upon our model and data size. Each hidden layer can have different numbers of neurons which are generally greater than the number of features. The output from each layer is computed by matrix multiplication of output of the previous layer with learnable weights of that layer and then by the addition of learnable biases followed by activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like sigmoid or softmax which converts the output of each class into the probability score of each class.

CNN consists of multiple layers like the input layer, Convolutional layer, Pooling layer, and fully connected layers.

Conv layer: apply filters to extract features

Pooling layer: downsample to reduce computation

fully connected layer: make final prediction

The network learns the optimal filters through backpropagation and gradient descent.

Advantages of Convolutional Neural Networks (CNNs):

1. Good at detecting patterns and features in images, videos, and audio signals.
2. Robust to translation, rotation, and scaling invariance.
3. End-to-end training, no need for manual feature extraction.
4. Can handle large amounts of data and achieve high accuracy.

Disadvantages of Convolutional Neural Networks (CNNs):

1. Computationally expensive to train and require a lot of memory.
2. Can be prone to overfitting if not enough data or proper regularization is used.
3. Requires large amounts of labeled data.
4. Interpretability is limited, it's hard to understand what the network has learned.

Backpropagation

https://www.geeksforgeeks.org/backpropagation-in-data-mining/

Backpropagation is an algorithm that backpropagates the errors from the output nodes to the input nodes.

Backpropagation is a widely used algorithm for training feedforward neural networks.

The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight via the chain rule, computing the gradient layer by layer, and iterating backward from the last layer to avoid redundant computation of intermediate terms in the chain rule.

Features of Backpropagation:

1. it is the gradient descent method as used in the case of simple perceptron network with the differentiable unit.
2. it is different from other networks in respect to the process by which the weights are calculated during the learning period of the network.
3. training is done in the three stages:
- the feed-forward of input training pattern
- the calculation and backpropagation of the error
- updation of the weight

Working of Backpropagation:

supervised learning

Need for Backpropagation:

Backpropagation is "backpropagation of errors" and is very useful for training neural networks. It's fast, easy to implement, and simple. Backpropagation does not require any parameters to be set, except the number of inputs. Backpropagation is a flexible method because no prior knowledge of the network is required.

Types of Backpropagation

There are two types of backpropagation networks.

1. Static backpropagation: Static backpropagation is a network designed to map static inputs for static outputs. These types of networks are capable of solving static classification problems such as OCR (Optical Character Recognition).
2. Recurrent backpropagation: Recursive backpropagation is another network used for fixed-point learning. Activation in recurrent backpropagation is feed-forward until a fixed value is reached. Static backpropagation provides an instant mapping, while recurrent backpropagation does not provide an instant mapping.

Advantages:

1. It is simple, fast, and easy to program.
2. Only numbers of the input are tuned, not any other parameter.
3. It is Flexible and efficient.
4. No need for users to learn any special functions.

Disadvantages:

1. It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
2. Performance is highly dependent on input data.
3. Spending too much time training.
4. The matrix-based approach is preferred over a mini-batch.

Activation Functions

https://www.geeksforgeeks.org/activation-functions-neural-networks/

In the process of building a neural network, one of the choices you get to make is what Activation Function to use in the hidden layer as well as at the output layer of the network.

Elements of a Neural Network

Input Layer: This layer accepts input features. It provides information from the outside world to the network, no computation is performed at this layer, nodes here just pass on the information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of the abstraction provided by any neural network. The hidden layer performs all sorts of computation on the features entered through the input layer and transfers the result to the output layer.
Output Layer: This layer bring up the information learned by the network to the outer world.

What is an activation function and why use them?

The activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. The purpose of the activation function is to introduce non-linearity into the output of a neuron.

Why do we need Non-linear activation function?

A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

Variants of Activation Function

Linear Function

Equation: Linear function has the equation similar to as of a straight line i.e. y = x

No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer.

Range: -inf to +inf

Uses: Linear activation function is used at just one place i.e. output layer.

Issues: If we will differentiate linear function to bring non-linearity, result will no more depend on input "x" and function will become constant, it won't introduce any ground-breaking behavior to our algorithm.
Sigmoid Function

Equation: A = 1/(1 + e^-x)

Nature: Non-linear. Notice that X values lies between -2 to 2, Y values are very steep. This means, small changes in x would also bring about large changes in the value of Y.

Value Range: 0 to 1

Uses: Usually used in output layer of a binary classification, where result is either 0 or 1, as value for sigmoid function lies between 0 and 1 only so, result can be predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.
Tanh Function

The activation that works almost always better than sigmoid function is Tanh function also known as Tangent Hyperbolic function. It's actually mathematically shifted version of the sigmoid function. Both are similar and can be derived from each other.
ReLU Function

It is the most widely used activation function. Chiefly implemented in hidden layers of Neural network.

Equation: A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.

Value Range: [0, inf)

Nature: non-linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.

Uses: ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. At a time only a few neurons are activated making the network sparse making it efficient and easy for computation.

In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function

The softmax function is also a type of sigmoid function but is handy when we are trying to handle multi- class classification problems.

Nature: non-linear

Uses: Usually used when trying to handle multiple classes. The softmax function was commonly found in the output layer of image classification problems. The softmax function would squeeze the outputs for each class between 0 and 1 and would also divide by the sum of the outputs.

Output: The softmax function is ideally used in the output layer of the classifier where we are actually trying to attain the probabilities to define the class of each input.

The basic rule of thumb is if you really don't know what activation function to use, then simply use RELU as it is a general activation function in hidden layers and is used in most cases these days.

If your output is for binary classification then, sigmoid function is very natural choice for output layer.

If your output is for multi-class classification then, Softmax is very useful to predict the probabilities of each classes.

Intro. to CNN Pooling Layer

https://www.geeksforgeeks.org/cnn-introduction-to-pooling-layer/

The pooling operation involves sliding a two-dimensional filter over each channel of feature map and summarising the features lying within the region covered by the filter.

Why to use Pooling Layers?

Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network.

The pooling layer summarises the features present in a region of the feature map generated by a convolution layer. So, further operations are performed on summarised features instead of precisely positioned features generated by the convolution layer. This makes the model more robust to variations in the position of the features in the input image.

Max Pooling

Avg Pooling

Global Pooling

Advantages of Pooling Layer:

1. Dimensionality reduction: The main advantage of pooling layers is that they help in reducing the spatial dimensions of the feature maps. This reduces the computational cost and also helps in avoiding overfitting by reducing the number of parameters in the model.
2. Translation invariance: Pooling layers are also useful in achieving translation invariance in the feature maps. This means that the position of an object in the image does not affect the classification result, as the same features are detected regardless of the position of the object.
3. Feature selection: Pooling layers can also help in selecting the most important features from the input, as max pooling selects the most salient features and average pooling preserves more information.

Disadvantages of Pooling Layer:

1. Information loss: One of the main disadvantages of pooling layers is that they discard some information from the input feature maps, which can be important for the final classification or regression task.
2. Over-smoothing: Pooling layers can also cause over-smoothing of the feature maps, which can result in the loss of some fine-grained details that are important for the final classification or regression task.
3. Hyperparameter tuning: Pooling layers also introduce hyperparameters such as the size of the pooling regions and the stride, which need to be tuned in order to achieve optimal performance. This can be time-consuming and requires some expertise in model building.

NN Beginners Guide

https://www.geeksforgeeks.org/neural-networks-a-beginners-guide/

NN learn to perform tasks by being exposed to various datasets and examples without any task-specific rules. A NN generates identifying characteristics from the data they have been passed without being programmed with a pre-programmed understanding of these datasets. NN are based on computational models for threshold logic. Threshold logic is a combination of algorithms and mathematics. Components of a typical neural network involve neurons, connections which are known as synapses, weights, biases, propagation function, and a learning rule.

There are basically three sequence of events of learning process.

1. The neural network is simulated by an new environment.
2. Then the free parameters of the neural network is changed as a result of this simulation.
3. The neural network then responds in a new way to the environment because of the changes in its free parameters.

Types of Neural Networks

MLP: A type of feedforward neural network with three or more layers
CNN: A neural network that is designed to process input data that has a grid-like structure
RNN: A neural network that can operate on input sequences of variable length. Connections between the neurons in a directed cycle.
LSTM: A type of RNN
Seq2Seq: A type of neural network that uses two RNNs to map input sequences to output sequences
Shallow Neural Network: A neural network with only one hidden layer, often used for simpler tasks or as a building block for larger networks.

For example, a neural network will work with three vectors: a vector of attributes X, a vector of classes Y, and a vector of weights W.

The neural network does not work with any matrices where X's number of rows and columns do not match Y and W's number of rows.

Generalization in Neural Networks

https://www.kdnuggets.com/2019/11/generalization-neural-networks.html

Key words: generalisation, overfitting, balance, dropout, ensemble, regularisation

When training a neural network in deep learning, its performance on processing new data is key. Improving the model's ability to generalize relies on preventing overfitting using these important methods.

Generalisation means how good our model is at learning from the given data and applying the learnt information elsewhere.

Example:

There are 10 breeds of dogs. Tell it they are dogs.

Provide 2 new breeds of dogs, not seen before.

If the network can tell those 2 breeds of animals are also dogs, the network is well generalised.

Overfitting prevents generalisation. (No general understanding of sth)

Need to balance performance with generalization. More parameters may make the network perform better but it also might overfit.

We need to match the neural network's complexity with our data complexity.

Usually how deep learning practitioners go about this is to first train a neural network with a sufficiently high number of parameters such that the model will overfit. So initially, we try to get a model that fits extremely well on the training data. Next we try and reduce the number of parameters iteratively until the model stops overfitting, this can be considered as an optimal neural network. Another technique that we can use to prevent overfitting is using dropout neurons.

Dropout Neurons

At a particular instant, there will be certain neurons that will not be connected to any other neuron in the network. Each neuron in the network has a certain probability of completely dropping out from the network.

At every instant during training, a different set of neurons will be dropped out in a random fashion. Hence we can say that at each instant we are effectively training a certain subset neural network that has fewer neurons than the original neural network. This subset neural network will change each and every time because of the random nature of the dropout neurons. (Some sets of neurons are trained on some examples or features but other sets are not trained on those specific examples)

Training multiple neural networks and aggregating their learnings is called "ensembling" and usually boosts performance. Using dropout essentially does this with while having only 1 neural network.

Weight Regularization

While training neural networks, there is a possibility that the value of certain weights can become very large. This happens because these weights are focusing on certain features in the training data, which is causing them to increase in value continuously throughout the training process. Because of this, the network overfits on the training data.

One of the other reasons why huge weights are bad for a neural network is because of the increased input-output variance. Basically when there is a huge weight in the network, it is very susceptible to small changes in the input, but the neural network should essentially output the same thing for similar inputs. (Think of an analogy. A floating-point number used to compare sth has less precision and a floating-point number has more precision. The one having more precision is more sensitive to the thing it compares)

By adding L1 or L2 regularization, the network will limit the growth of its weights since the magnitude of the weights are a part of the loss function, and the network always tries to minimize the loss function.

In deep learning, we usually use a mix of these methods to improve the performance of our neural networks and to improve the generalization of our models.

Understanding the Design of a Convolutional Neural Network

https://machinelearningmastery.com/understanding-the-design-of-a-convolutional-neural-network/

The most important hyperparameter to a convolutional layer is the size of the filter. Usually, it is in a square shape, and you can consider that as a window or receptive field to look at the input image. Therefore, the higher resolution of the image, then you can expect a larger filter.

On the other hand, a filter too large will blur the detailed features because all pixels from the receptive field through the filter will be combined into one pixel at the output feature map. Therefore, there is a trade-off for the appropriate size of the filter.

Stacking two convolutional layers (without any other layers in between) is equivalent to a single convolutional layer with a larger filter. But a typical design to use nowadays is two layers with small filters stacked together rather than one larger with a larger filter, as there are fewer parameters to train.

The exception would be a convolutional layer with a 1x1 filter. This is usually found as the beginning layer of a network. The purpose of such a convolutional layer is to combine the input channels into one rather than transforming the pixels. Conceptually, this can convert a color image into grayscale, but usually, you can use multiple ways of conversion to create more input channels than merely RGB for the network.

Also, note that in the above network, you are using Conv2D for a 2D filter. There is also a Conv3D layer for a 3D filter. The difference is whether you apply the filter separately for each channel or feature map or consider the input feature maps stacked up as a 3D array and apply a single filter to transform it altogether. Usually, the former is used as it is more reasonable to consider no particular order in which the feature maps should be stacked.

Neural Architecture Search: Everything You Need to Know

https://deci.ai/neural-architecture-search/

Neural Architecture Search is a subfield of automated machine learning (AutoML), so let's first understand what that is. Also referred to as automated ML, AutoML is an overarching term that relates to the process of automating the different tasks involved in applying machine learning to real-world problems.

A subfield of automated ML, NAS is a technique that can help discover the best neural networks for a given problem. It automates the designing of DNNs, ensuring higher performance and lower losses than manually designed architectures. It is also much faster than the traditional manual processes.

The general idea behind NAS is to select the optimal architecture from a space of allowable architectures. The selection algorithm relies on a search strategy, which in turn depends on an objective evaluation scheme. The popular convolutional neural network EfficientNet is an example of an architecture generated by NAS.

Neural Architecture Search has three main building blocks that can be categorized in terms of search space, search strategy/algorithm, and evaluation strategy (Elsken et al., 2019). Each of these components can utilize different methods.

Search space
Defining the operations used to design DNNs.

The NAS search space determines what type of architecture can be discovered by the NAS algorithm. It is defined by a set of operations that specify the overall structure (skeleton) of the network, the type of units or blocks that define the layers, as well as the allowable connectivity between layers to create architectures.

The more elements the search space has, the more complex and versatile it gets. But naturally, as the search space expands, the costs of finding the best architecture also increase. Types of operations used in defining the search space include sequential layer-wise operations, cell-based representation, hierarchical structure, and more.
Search strategy
Optimizing metrics according to the approach used to explore the search space is essential for search strategy and performance estimation.

The search strategy determines how the NAS algorithm experiments with different neural networks.

There are various methods that optimize search strategies to make the process deliver better results faster and with consistency. Types of search algorithms include random search, neuro-evolutionary methods (Elsken et al., 2019), Bayesian approaches (Mendoza et al., 2016), and reinforcement learning (Zoph and Le, 2016).

Some recent evidence suggests that evolutionary techniques perform just as well as reinforcement learning (Real et al., 2019). Moreover, the evolutionary methods tend to have better "anytime performance" and settle on smaller models. While earlier NAS techniques were based on discrete search spaces, a continuous formulation of the architecture search space has introduced differentiable search methods, which opened the way for gradient based optimization (Liu et al., 2019).
- Random Search
  Random search is composed of a NAS algorithm that randomly selects a neural network architecture from the search space. It is an expensive process. Because instead of using a more efficient approach, it brute forces its way through the search space. It can take a large number of GPU days (up to hundreds to thousands) in one search depending on the complexity of the search space.
- Reinforcement Learning
  The model seeks a solution to a problem that maximizes rewards. It does this based on trial and error and getting rewards and penalties. This enables making decisions in a complex environment, and learning to choose configurations that produce better neural networks for NAS.
  
  Multi-Objective Neural Architecture Search (MONAS) with Reinforcement Learning. It optimizes for scalability while ensuring accuracy and minimizing power consumption.
- Bayesian Optimization
  Bayesian optimization is a suite of techniques that are often used to speed up the search process. It is a popular approach for hyperparameter optimization. At a high level, it starts with choosing and evaluating architectures at random, then gradually tuning its search direction based on the evaluation results and information gathered about the performance of different architectures.
  
  Bayesian optimization is expensive for NAS, but in 2020, White et al. achieved success through Bayesian Optimization with Neural Architectures for Neural Architecture Search (BANANAS), by coupling it to a neural predictor. Recent accomplishments include Neural Architecture Search with Bayesian Optimisation and Optimal Transport (NASBOT) by Kandasamy et al. (2018) for multi-layer perceptrons. There are also convolutional networks and A Bayesian Approach for Neural Architecture Search (BayesNAS) by Zhou et al. (2019) for one shot architecture search.
Evaluation strategy
Evaluating the performance of the DNN, prior to construction and training.

During a NAS search, the algorithm trains, evaluates, validates, and compares performance before choosing the optimal neural network. Full training on each neural network typically requires a long time and high compute demand - thousands of GPU days.

To reduce the costs of evaluating deep learning models, several strategies can be used (Jin et al., 2019), including:
- Proxy task performance
- Low-fidelity performance estimation - early exit after a few epochs, training over a subset of the data, downscaled models or data
- Weight inheritance
- Weight sharing
- Learning-curve extrapolation
- Network morphism

Combining Search and Evaluation: One-Shot Approach

It's expensive to run search and evaluation independently for a large population of child models. To address this, another group of NAS methods, under the umbrella of one-shot architecture search, include a search space of sub-architectures belonging to a single super-architecture with trained weights that are shared among all sub-models (Xie et al., 2019). One-shot methods differ according to how the one-shot model is trained.

Efficient Neural Architecture Search (ENAS) by Pham et al. (2018) is a prominent example of a single-shot algorithm that achieves a 1000X speedup of the search relative to previous techniques. One of the prominent results in this venue is the "once for all" technique (Cai et al., 2019).

Here are some examples of models generated with NAS and their uses on architectures:

Image Classification - Real and al, (2019) evolved an image classifier, AmoebaNet-A, which exceeded hand-designs for the first time as well as other SoTA models. They achieved this by modifying the tournament selection evolutionary algorithm by introducing an age property to favor the younger genotypes.
Meta-learning - Vanschoren (2018) discussed meta-learning and meta-data for learning tasks to automate ML pipelines or neural architectures.

Transfer Learning and NAS

Transfer learning is another AutoML approach that reuses a pre-trained model that was developed for one task as a starting point for working on a new problem. The intuition behind it is that neural architectures trained on a large enough dataset can serve as generic models that generalize for similar types of problems. Transfer learning is popular in deep learning as the learned feature maps can be used to train DNNs with a small amount of data.

In contrast, the underlying idea of NAS is that every dataset and its accompanying hardware and production environment, has a specific and unique architecture that performs best with it. Unlike transfer learning, there's flexibility and customization in NAS that requires data scientists and developers to learn and train weights for the new architecture.

The Benefits: Why Use NAS?

NAS allows to:

Automate the architecture selection and testing process
Easily achieve better accuracy and inference speed results
Shorten the iterative and cumbersome process of getting deep learning models to production

Selected Resources for Getting Started with NAS

Using NAS involves a lot of moving parts. Here is a list of selected work from AutoML.org

Open-Source Libraries
NAS Benchmarks to Improve Efficiency
Best Practices for Using NAS

NAS Limitations

Resource Intensive
Real-time Blindness
NAS models are trained on offline data. It can be difficult to predict how they will perform on architectures with real data.
Time

AutoML

https://www.automl.org/automl/

What is AutoML?

Relies on human machine learning experts to perform the following tasks:

Preprocess and clean the data.
Select and construct appropriate features.
Select an appropriate model family.
Optimize model hyperparameters.
Design the topology of neural networks (if deep learning is used).
Postprocess machine learning models.
Critically analyze the results obtained.

AutoML to Advance and Improve Research

Hyperparameter optimization and algorithm configuration provide methods to automate the tedious, time-consuming and error-prone process of tuning hyperparameters to new tasks at hand. We for example provide packages for hyperparameter optimization:

SMAC3 - a python re-implementation of the SMAC algorithm
DEHB: Differential Evolution combined with HyperBand

Architecture Search and Automated Deep Learning

Packages for architecture search and hyperparameter optimization for deep learning include:

Auto-PyTorch
AutoKeras
talos

Genetic Algorithm from MathWorks

https://www.mathworks.com/help/gads/genetic-algorithm.html

Genetic algorithm solves smooth or nonsmooth optimization problems with any types of constraints, including integer constraints. It is a stochastic, population-based algorithm that searches randomly by mutation and crossover among population members.

GA Terminology

https://www.mathworks.com/help/gads/some-genetic-algorithm-terminology.html

Fitness Functions
The fitness function is the function you want to optimize (=objective function for standard optimization algorithms).
Individuals
The value of the fitness function for an individual is its score,

e.g. f(x1,x2,x3)=(2x1+1)^2+(3x2+4)^2+(x3-2)^2.

The vector (2, -3, 1), whose length is the number of variables in the problem, is an individual. The score of the individual (2, -3, 1) is f(2, -3, 1) = 51.

An individual is sometimes referred to as a genome and the entries of the vector of an individual as genes.
Populations and Generations
A population is an array of individuals. For example, if the size of the population is 100 and the number of variables in the fitness function is 3 (each individual has 3 genes), you represent the population by a 100-by-3 matrix. The same individual can appear more than once in the population.

At each iteration, the genetic algorithm performs a series of computations on the current population to produce a new population. Each successive population is called a new generation.
Diversity
Diversity refers to the average distance between individuals in a population. A population has high diversity if the average distance is large.

Diversity is essential to the genetic algorithm because it enables the algorithm to search a larger region of the space.

The graph pattern is usually more scattered, if high diversity.
Best Fitness Values
The fitness value of an individual is the value of the fitness function for that individual.

The best fitness value for a population is the smallest fitness value for any individual in the population.
Parents and Children
The genetic algorithm selects certain individuals in the current population, called parents, and uses them to create individuals in the next generation, called children. Typically, the algorithm is more likely to select parents that have better fitness values.
Elite
Some of the individuals in the current population that have lower fitness are chosen as elite. These elite individuals are passed to the next population.

Fitness Scaling

https://www.mathworks.com/help/gads/fitness-scaling.html

Scaling the Fitness Scores

Scaled values = expectation values.

Fitness scaling converts the raw fitness scores that are returned by the fitness function to values in a range that is suitable for the selection function. The selection function uses the scaled fitness values to select the parents of the next generation. The selection function assigns a higher probability of selection to individuals with higher scaled values.

The range of the scaled values affects the performance of the genetic algorithm. If the scaled values vary too widely, the individuals with the highest scaled values reproduce too rapidly, taking over the population gene pool too quickly, and preventing the genetic algorithm from searching other areas of the solution space. On the other hand, if the scaled values vary only a little, all individuals have approximately the same chance of reproduction and the search will progress very slowly.

Scaling options: Rank, Top, etc.(?)

Refer to the webpage to see more.

https://www.mathworks.com/help/gads/what-is-the-genetic-algorithm.html

What Is the Genetic Algorithm?

The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions. At each step, the genetic algorithm selects individuals from the current population to be parents and uses them to produce the children for the next generation. Over successive generations, the population "evolves" toward an optimal solution. You can apply the genetic algorithm to solve a variety of optimization problems that are not well suited for standard optimization algorithms, including problems in which the objective function is discontinuous, nondifferentiable, stochastic, or highly nonlinear. The genetic algorithm can address problems of mixed integer programming, where some components are restricted to be integer-valued.

The genetic algorithm differs from a classical, derivative-based, optimization algorithm in two main ways, as summarized in the following:

Classical Algorithm

Generates a single point at each iteration. The sequence of points approaches an optimal solution.
Selects the next point in the sequence by a deterministic computation.
Typically converges quickly to a local solution.
Generates a population of points at each iteration. The best point in the population approaches an optimal solution.
Selects the next population by computation which uses random number generators.
Typically takes many function evaluations to converge. May or may not converge to a local or global minimum.

How the Genetic Algorithm Works

https://www.mathworks.com/help/gads/how-the-genetic-algorithm-works.html

Outline of the Algorithm

... Produces children from the parents. Children are produced either by making random changes to a single parent-mutation-or by combining the vector entries of a pair of parents-crossover. ...

Initial Population

If you know approximately where the minimal point for a function lies, you should set InitialPopulationRange so that the point lies near the middle of that range.

Creating the Next Generation

The algorithm selects a group of individuals in the current population, called parents, who contribute their genes-the entries of their vectors-to their children. The algorithm usually selects individuals that have better fitness values as parents. You can specify the function that the algorithm uses to select the parents in the SelectionFcn option.

The genetic algorithm creates three types of children for the next generation:

Elite children are the individuals in the current generation with the best fitness values. These individuals automatically survive to the next generation.
Crossover children are created by combining the vectors of a pair of parents.
Mutation children are created by introducing random changes, or mutations, to a single parent.

Crossover Children

At each coordinate of the child vector, the default crossover function randomly selects an entry, or gene, at the same coordinate from one of the two parents and assigns it to the child. For problems with linear constraints, the default crossover function creates the child as a random weighted average of the parents.

Mutation Children

By default, for unconstrained problems the algorithm adds a random vector from a Gaussian distribution to the parent. For bounded or linearly constrained problems, the child remains feasible.

As the number of generations increases, the individuals in the population get closer together and approach the minimum point [0 0].

What Is Global Optimization?

https://www.mathworks.com/help/gads/what-is-global-optimization.html

Local vs. Global Optima

A local minimum of a function is a point where the function value is smaller than or equal to the value at nearby points, but possibly greater than at a distant point.

A global minimum is a point where the function value is smaller than or equal to the value at all other feasible points.

Generally, Optimization Toolbox solvers find a local optimum. (This local optimum can be a global optimum.) They find the optimum in the basin of attraction of the starting point.

In contrast, Global Optimization Toolbox solvers are designed to search through more than one basin of attraction. They search in various ways:

GlobalSearch and MultiStart generate a number of starting points. They then use a local solver to find the optima in the basins of attraction of the starting points.
ga uses a set of starting points (called the population) and iteratively generates better points from the population. As long as the initial population covers several basins, ga can examine several basins.
...

Basins of Attraction

Springer: The basin of attraction of an attracting set is the set of all the initial conditions in the phase space whose trajectories go to that attracting set.

Scholarpedia: For each such attractor, its basin of attraction is the set of initial conditions leading to long-time behavior that approaches that attractor.

GA from GeeksforGeeks

https://www.geeksforgeeks.org/genetic-algorithms

Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent exploitation of random search provided with historical data to direct the search into the region of better performance in solution space.

Each generation consist of a population of individuals and each individual represents a point in search space and possible solution. Each individual is represented as a string of character/integer/float/bits. This string is analogous to the Chromosome.

The population of individuals are maintained within search space. Each individual represents a solution in search space for given problem. Each individual is coded as a finite length vector (analogous to chromosome) of components. These variable components are analogous to Genes. Thus a chromosome (individual) is composed of several genes (variable components).

Fitness Score

The population size is static so the room has to be created for new arrivals. So, some individuals die and get replaced by new arrivals eventually creating new generation when all the mating opportunity of the old population is exhausted. It is hoped that over successive generations better solutions will arrive while least fit die.

Once the offspring produced having no significant difference from offspring produced by previous populations, the population is converged. The algorithm is said to be converged to a set of solutions for the problem.

Operators of Genetic Algorithms

...
Mutation Operator: The key idea is to insert random genes in offspring to maintain the diversity in the population to avoid premature convergence.

The whole algorithm can be summarized as:

1) Randomly initialize populations p
2) Determine fitness of population
3) Until convergence repeat:
- a) Select parents from population
- b) Crossover and generate new population
- c) Perform mutation on new population
- d) Calculate fitness for new population

Example problem and solution using Genetic Algorithms

const string TARGET = "I love GeeksforGeeks";

Generation: 1 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18

Generation: 2 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18

Generation: 3 String: .#lRWf9k_Ifslw #O$k_ Fitness: 17

Generation: 4 String: .-1Rq?9mHqk3Wo]3rek_ Fitness: 16

Generation: 5 String: .-1Rq?9mHqk3Wo]3rek_ Fitness: 16

Generation: 6 String: A#ldW) #lIkslw cVek) Fitness: 14

Generation: 7 String: A#ldW) #lIkslw cVek) Fitness: 14

Generation: 8 String: (, o x _x%Rs=, 6Peek Fitness: 13

...

Generation: 29 String: I lope Geeks#o,Geeks Fitness: 3

Generation: 30 String: I loMe GeeksfoBGeeks Fitness: 2

Generation: 31 String: I love Geeksfo0Geeks Fitness: 1

Generation: 32 String: I love Geeksfo0Geeks Fitness: 1

Generation: 33 String: I love Geeksfo0Geeks Fitness: 1

Generation: 34 String: I love GeeksforGeeks Fitness: 0

As we can see from the output, our algorithm sometimes stuck at a local optimum solution, this can be further improved by updating fitness score calculation algorithm or by tweaking mutation and crossover operators.

Encoding Methods in Genetic Algorithm

https://www.geeksforgeeks.org/encoding-methods-in-genetic-algorithm

Chromosome: All living organisms consist of cells. In each cell, there is the same set of Chromosomes. Chromosomes are strings of DNA and consist of genes, blocks of DNA. Each gene encodes a trait, for example, the color of the eye.

Operation of Genetic Algorithms:

Two important elements required for any problem before a genetic algorithm can be used for a solution are:

Method for representing a solution ex: a string of bits, numbers, character ex: determination of total weight.
Method for measuring the quality of any proposed solution, using a fitness function.

Encoding using string:

Encoding of chromosomes is the first step in solving the problem and it depends entirely on the problem heavily. The process of representing the solution in the form of a string of bits that conveys the necessary information. Just as in a chromosome, each gene controls particular characteristics of the individual, similarly, each bit in the string represents characteristics of the solution.

Encoding Methods:

Binary Encoding: Most common methods of encoding. Chromosomes are string of 1s and 0s and each position in the chromosome represents a particular characteristics of the solution.
Permutation Encoding: Useful in ordering such as the Travelling Salesman Problem (TSP). In TSP, every chromosome is a string of numbers, each of which represents a city to be visited.
Value Encoding: Used in problems where complicated values, such as real numbers, are used and where binary encoding would not suffice. Good for some problems, but often necessary to develop some specific crossover and mutation techniques for these chromosomes.

Introduction to Genetic Algorithms

https://courses.cs.washington.edu/courses/cse473/06sp/GeneticAlgDemo

Genetic algorithms are a part of evolutionary computing, which is a rapidly growing area of artificial intelligence.

History

Evolutionary computing was introduced in the 1960s by I. Rechenberg in his work "Evolution strategies" (Evolutionsstrategie in original). His idea was then developed by other researchers. Genetic Algorithms (GAs) were invented by John Holland and developed by him and his students and colleagues. This lead to Holland's book "Adaption in Natural and Artificial Systems" published in 1975.

In 1992 John Koza has used genetic algorithm to evolve programs to perform certain tasks. He called his method "genetic programming" (GP). LISP programs were used, because programs in this language can expressed in the form of a "parse tree", which is the object the GA works on.

Chromosome

Chromosomes are strings of DNA and serve as a model for the whole organism.

Each gene encodes a particular protein. Basically, it can be said that each gene encodes a trait, for example color of eyes. Possible settings for a trait (e.g. blue, brown) are called alleles (uh-lee-l). Each gene has its own position in the chromosome. This position is called locus (loh-kus, a particular position, point, or place).

Complete set of genetic material (all chromosomes) is called genome. Particular set of genes in genome is called genotype. The genotype is with later development after birth base for the organism's phenotype, its physical and mental characteristics, such as eye color, intelligence etc.

Reproduction

During reproduction, recombination (or crossover) first occurs. Genes from parents combine to form a whole new chromosome. The newly created offspring can then be mutated.

Search Space

If we are solving a problem, we are usually looking for some solution which will be the best among others. The space of all feasible solutions (the set of solutions among which the desired solution resides) is called search space (also state space). Each point in the search space represents one possible solution. Each possible solution can be "marked" by its value (or fitness) for the problem. With GA we look for the best solution among a number of possible solutions - represented by one point in the search space.

Looking for a solution is then equal to looking for some extreme value (minimum or maximum) in the search space. At times the search space may be well defined, but usually we know only a few points in the search space. In the process of using GA, the process of finding solutions generates other points (possible solutions) as evolution proceeds.

The problem is that the search can be very complicated. One may not know where to look for a solution or where to start. There are many methods one can use for finding a suitable solution, but these methods do not necessarily provide the best solution. Some of these methods are hill climbing, tabu search, simulated annealing and the genetic algorithm. The solutions found by these methods are often considered as good solutions, because it is not often possible to prove what the optimum is.

NP-hard Problems

One example of a class of problems which cannot be solved in the "traditional" way, are NP problems.

There are many tasks for which we may apply fast (polynomial) algorithms. There are also some problems that cannot be solved algorithmically.

There are many important problems in which it is very difficult to find a solution, but once we have it, it is easy to check the solution. This fact led to NP-complete problems. NP stands for nondeterministic polynomial and it means that it is possible to "guess" the solution (by some nondeterministic algorithm) and then check it.

If we had a guessing machine, we might be able to find a solution in some reasonable time.

Studying of NP-complete problems is, for simplicity, restricted to the problems where the answer can be yes or no. Because there are tasks with complicated outputs, a class of problems called NP-hard problems has been introduced. This class is not as limited as class of NP-complete problems.

A characteristic of NP-problems is that a simple algorithm, perhaps obvious at a first sight, can be used to find usable solutions. But this approach generally provides many possible solutions - just trying all possible solutions is very slow process (e.g. O(2^n)). For even slightly bigger instances of these type of problems this approach is not usable at all.

Today nobody knows if some faster algorithm exists to provide exact answers to NP-problems. The discovery of such algorithms remains a big task for researchers (maybe you! :-)). Today many people think that such algorithms do not exist and so they are looking for an alternative method. An example of an alternative method is the genetic algorithm.

Examples of the NP problems are satisfiability problem, travelling salesman problem or knapsack problem.

Genetic Algorithm

Outline of the Basic Genetic Algorithm

[Start] Generate random population of n chromosomes (suitable solutions for the problem)
[Fitness] Evaluate the fitness f(x) of each chromosome x in the population
[New population] Create a new population by repeating following steps until the new population is complete
1. [Selection] Select two parent chromosomes from a population according to their fitness (the better fitness, the bigger chance to be selected)
2. [Crossover] With a crossover probability cross over the parents to form new offspring (children). If no crossover was performed, offspring is the exact copy of parents.
3. [Mutation] With a mutation probability mutate new offspring at each locus (position in chromosome).
4. [Accepting] Place new offspring in the new population
[Replace] Use new generated population for a further run of the algorithm
[Test] If the end condition is satisfied, stop, and return the best solution in current population
[Loop] Go to step 2

Some Comments

As you can see, the outline of the Basic GA is very general. There are many parameters and settings that can be implemented differently in various problems.

The first question to ask is how to create chromosomes and what type of encoding to choose. We then address Crossover and Mutation, the two basic operators of GA. Encoding, crossover and mutation are introduced in next chapter.

The next question is how to select parents for crossover. This can be done in many ways, but the main idea is to select the better parents (best survivors) in the hope that the better parents will produce better offspring.

You may think that generating populations from only from two parents may cause you to loose the best chromosome from the last population. This is true, and so elitism is often used. This means, that at least one of a generation's best solution is copied without changes to a new population, so the best solution can survive to the succeeding generation.

You might be wandering, why genetic algorithms work. It can be partially explained by the Schema Theorem (Holland), however, this theorem has been criticized in recent times. If you want to know more, check other resources.

Parameters of GA

There are two basic parameters of GA - crossover probability and mutation probability.

...

Mutation generally prevents the GA from falling into local extremes. Mutation should not occur very often, because then GA will in fact change to random search.

There are also some other parameters of GA. One another particularly important parameter is population size.

Population size: how many chromosomes are in population (in one generation). If there are too few chromosomes, GA have few possibilities to perform crossover and only a small part of search space is explored. On the other hand, if there are too many chromosomes, GA slows down. Research shows that after some limit (which depends mainly on encoding and the problem) it is not useful to use very large populations because it does not solve the problem faster than moderate sized populations.

Selection

There are many methods in selecting the best chromosomes. Examples are roulette wheel selection, Boltzman selection, tournament selection, rank selection, steady state selection and some others.

Roulette Wheel Selection

A pie proportional to fitness values.

Rank Selection

Problem: if the best chromosome fitness is 90% of the sum of all fitnesses then the other chromosomes will have very few chances to be selected.

Rank selection ranks the population first and then every chromosome receives fitness value determined by this ranking. The worst will have the fitness 1, the second worst 2 etc. and the best will have fitness N (number of chromosomes in population).

A pie proportional to ranks. E.g., in Roulette Wheel, the pie is proportional to fitness values (e.g. 5:10:35:60:...); in Rank, the fitness values are sorted and assigned numbers from 0 to N. So the pie is proportional to ranks (1:2:3:4:...).

Steady-State Selection

This is not a particular method of selecting parents. The main idea of this type of selecting to the new population is that a big part of chromosomes can survive to next generation.

The stady-state selection GA works in the following way. In every generation a few good (with higher fitness) chromosomes are selected for creating new offspring. Then some bad (with lower fitness) chromosomes are removed and the new offspring is placed in their place. The rest of population survives to new generation.

Elitism

Encoding

Binary Encoding

Example of Encoding: Each bit says, whether the corresponding thing is in knapsack.

Permutation Encoding

Example of Encoding: Chromosome describes the order of cities, in which the salesman will visit them.

Value Encoding

Example of Encoding: Real values in chromosomes represent weights in the neural network.

Tree Encoding

Example of Encoding: Chromosome are program functions represented in a tree.

Crossover and Mutation

Binary Encoding

Crossover

Single point crossover
Two point crossover
Uniform crossover
Arithmetic crossover

Mutation

Bit inversion

Permutation Encoding

Crossover

Single point crossover - one crossover point is selected, the permutation is copied from the first parent till the crossover point, then the other parent is scanned and if the number is not yet in the offspring, it is added
Note: there are more ways how to produce the rest after crossover point

Mutation

Order changing - two numbers are selected and exchanged

Value Encoding

Crossover

All crossovers from binary encoding can be used

Mutation

Adding a small number (for real value encoding) - a small number is added to (or subtracted from) selected values

Tree Encoding

Crossover

Tree crossover - one crossover point is selected in both parents, parents are divided in that point and the parts below crossover points are exchanged to produce new offspring

Mutation

Changing operator, number - selected nodes are changed

Travelling Salesman Problem

There are cities and given distances between them. Travelling salesman has to visit all of them, but he does not want to travel very much. The task is to find a sequence of cities to minimize travelled distance. In other words, find a minimal Hamiltonian tour in a complete graph of N nodes.

You can select crossover and mutation type. The description of their meaning follows:

...

Mutation

Normal random - a few cities are chosen and exchanged
Random, only improving - a few cities are randomly chosen and exchanged only if they improve solution (increase fitness)
Systematic, only improving - cities are systematically chosen and exchanged only if they improve solution (increase fitness)
Random improving - the same as "random, only improving", but before that the "normal random" mutation is performed
Systematic improving - the same as "systematic, only improving", but before that the "normal random" mutation is performed
None - no mutation

Recommendations: Parameters of GA

Recommendations are often results of empiric studies of GAs that were often performed on binary encoding only.

Crossover rate
Crossover rate should be high generally, about 80%-95%. (However some results show that for some problems crossover rate about 60% is the best.)
Mutation rate
On the other side, mutation rate should be very low. Best rates seems to be about 0.5%-1%.
Population size
It may be surprising, that very big population size usually does not improve performance of GA (in the sense of speed of finding solution). Good population size is about 20-30, however sometimes sizes 50-100 are reported as the best. Some research also shows, that the best population size depends on the size of encoded string (chromosomes). It means that if you have chromosomes with 32 bits, the population should be higher than for chromosomes with 16 bits.
Selection
Basic roulette wheel selection can be used, but sometimes rank selection can be better. Check the chapter about selection for advantages and disadvantages. There are also some more sophisticated methods that change parameters of selection during the run of GA. Basically, these behave similarly like simulated annealing. Elitism should be used for sure if you do not use other method for saving the best found solution. You can also try steady state selection.
Encoding
Encoding depends on the problem and also on the size of instance of the problem. Check the chapter about encoding for some suggestions or look to other resources.
Crossover and mutation type
Operators depend on the chosen encoding and on the problem. Check the chapter about operators for some suggestions. You can also check other sites.

Applications of GA

The advantage of GAs is in their parallelism. GA is travelling in a search space using more individuals (and with genotype rather than phenotype) so that they are less likely to get stuck in a local extreme like the other methods.

They are also easy to implement. Once you have the basic GA algorithm implemented, you have just to write a new chromosome (just one object) to solve another problem. With the same encoding you just change the fitness function - and you are done. However, for some problems, choosing and implementation of encoding and fitness function can be difficult.

The disadvantage of GAs is in the computational time. GAs can be slower than other methods. But sice we can terminate the computation in any time, the longer run is acceptable (especially with faster and faster computers).