Machine Learning
Machine Learning learn how to combine input to produce useful predictions on never-before-seen data.
Fundamental concept in Machine learning
Feature (x) - variable input
Label (y) - things predicting
example - a particular instance of data x
- labeled example - instances of features (x) with labels (y)
- unlabeled example- instances of features (x) without labels (y)
Training - creating or learning the model, show the model with labeled exmples to enable the model gradually learn the relationships of features and label
Inference - apply the trained model to unlabeled examples for prediction.
Model types
- Regression - predict continuous value
- Classification - predict discrete value
Linear Regression
y = mx +c
y = b +wx
y - labels
x - feature
b - bias
w - weight of the feature
To infer (predict) y, substitute the value into x.
Empirical risk minimization - a process in supervised learning, a machine learning algorithm build a model by examining many examples and attempting to find a model that minimize lose.
Loss - a number show how bad the model's prediction was on a single example. If the prediction is perfect, the loss is zero; otherwise, the loss is greater.
Squared loss - the square of the difference between the label and the prediction
Interpretation
http://www.leeds.ac.uk/educol/documents/00003759.htm
https://www.khanacademy.org/math/calculus-home/taking-derivatives-calc
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives
To infer (predict) y, substitute the value into x.
Empirical risk minimization - a process in supervised learning, a machine learning algorithm build a model by examining many examples and attempting to find a model that minimize lose.
Loss - a number show how bad the model's prediction was on a single example. If the prediction is perfect, the loss is zero; otherwise, the loss is greater.
Squared loss - the square of the difference between the label and the prediction
Interpretation
http://www.leeds.ac.uk/educol/documents/00003759.htm
https://www.khanacademy.org/math/calculus-home/taking-derivatives-calc
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives
Reducing Loss
An iterative approach to training a model
A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.
Iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged. (examine all possible value of w1)
Convex problems have only one minimum; that is, only one place where the slope is exactly 0.
Iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged. (examine all possible value of w1)
Convex problems have only one minimum; that is, only one place where the slope is exactly 0.
Gradient descent to training a model
The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point doesn't matter much; therefore, many algorithms simply set w1 to 0 or pick a random value.
The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. The gradient if the loss curve is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.
A gradient is a vector with two characteristics:
Hyperparameters are the knobs that programmers tweak in machine learning algorithms.
If a learning rate that is too small, learning will take too long.
If a learning rate is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong.
Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size
Stochastic gradient descent (SGD) get the right gradient on average for much less computation by choosing examples at random from data set. We could estimate (albeit, noisily) a big average from a much smaller one. SGD uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.
Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.
A gradient is a vector with two characteristics:
- a direction
- a magnitude
Hyperparameters are the knobs that programmers tweak in machine learning algorithms.
If a learning rate that is too small, learning will take too long.
If a learning rate is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong.
Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size
Stochastic gradient descent (SGD)
In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.Stochastic gradient descent (SGD) get the right gradient on average for much less computation by choosing examples at random from data set. We could estimate (albeit, noisily) a big average from a much smaller one. SGD uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.
Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.
Tensorflow
Big Data
Big data is not about the size of the data, it is about the value within the data.
Big Data Analytics
A collection of frameworks for generate valuable equation (regression)
- MapReduce Framework
- Hadoop Distributed File System (HDFS)
- Cluster
Data in Big Data
- Structured data
- Semi-structure data
- Unstructured data
Big Data 4V
Justification on deploy big data in 4 directions
- Volume - Data Quantity
- Velocity - Data Speed
- Variety - Data Types
- Varecity (Analytics)
Justification 4V via two level
- Distributed Computation
- Distributed Storage
Data Charactrristics
- Activity Data
- Conversation Data
- Photo and Video image data
- Sensor Data
- The Internet of Things Data
AI - Artificial Intelligence
Machine Learning
Creating algorithms
Deep Learning
Using deep neural networks (NN) to automatically learn hierarchical representations
Machine Learning Task
- Classification - Predict a class of an object
- Regression - Predict a continuous value for an object
- Clustering - group similar object together
- Dimensionality reduction - " compress " data from a high- dimensional representation into a lower-dimensional one
- Ranking
- Recommendations - filter a small subset of objects from a large collection and recommend them to a user
Deep Learning
Commonly use deep learning toolkitsCaffe
CNTK
Tensorflow
Theano
Torch
Common use networks
- ConvNets: AlexNet, OxfordNet, GoogleNet
- RecurrentNets: plain RNN, LSTM/GRU, bidirectional RNN
- Sequential modeling with attention.
https://github.com/zer0n/deepframeworks/blob/master/README.md?utm_source=tuicool&utm_medium=referral
Reference
Convolutional Neural Networks for Visual Recognition
http://derekwaikl.blogspot.com/2018/06/convolutional-neural-networks-for.html
Deep Learning Tutorials
http://deeplearning.net/tutorial/
Stanford Deep Learning
http://deeplearning.stanford.edu/
Books to read:
Bengio Y. Learning Deep Architectures for AI[J]. Foundations & Trends® in Machine Learning, 2009, 2(1):1-127.
Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786):504-507.
He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. arXiv:1512.03385, 2015.
Srivastava R K, Greff K, Schmidhuber J. Highway networks. arXiv:1505.00387, 2015.
Would you be interested in trading links or maybe guest writing a blog post or vice-versa?
ReplyDeleteMachine Learning Online Training In India
Machine Learning Online Training