Logbook by Derek Wai: Machine Learning, Big Data, AI, Deep Learning

Machine Learning

Machine Learning learn how to combine input to produce useful predictions on never-before-seen data.

Fundamental concept in Machine learning

Feature (x) - variable input

Label (y) - things predicting

example - a particular instance of data x

labeled example - instances of features (x) with labels (y)
unlabeled example- instances of features (x) without labels (y)

Models - defines relationship between features and label

Training - creating or learning the model, show the model with labeled exmples to enable the model gradually learn the relationships of features and label

Inference - apply the trained model to unlabeled examples for prediction.

Model types

Regression - predict continuous value
Classification - predict discrete value

Linear Regression

y = mx +c

y = b +wx

y - labels

x - feature

b - bias

w - weight of the feature

To infer (predict) y, substitute the value into x.

Empirical risk minimization - a process in supervised learning, a machine learning algorithm build a model by examining many examples and attempting to find a model that minimize lose.

Loss - a number show how bad the model's prediction was on a single example. If the prediction is perfect, the loss is zero; otherwise, the loss is greater.

Squared loss - the square of the difference between the label and the prediction

Interpretation
http://www.leeds.ac.uk/educol/documents/00003759.htm
https://www.khanacademy.org/math/calculus-home/taking-derivatives-calc
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives

Reducing Loss

An iterative approach to training a model

A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.

Iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged. (examine all possible value of w1)

Convex problems have only one minimum; that is, only one place where the slope is exactly 0.

Gradient descent to training a model

The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point doesn't matter much; therefore, many algorithms simply set w1 to 0 or pick a random value.

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. The gradient if the loss curve is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

A gradient is a vector with two characteristics:

a direction
a magnitude

Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point.

Hyperparameters are the knobs that programmers tweak in machine learning algorithms.

If a learning rate that is too small, learning will take too long.
If a learning rate is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong.

Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size

Stochastic gradient descent (SGD)

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.

Stochastic gradient descent (SGD) get the right gradient on average for much less computation by choosing examples at random from data set. We could estimate (albeit, noisily) a big average from a much smaller one. SGD uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

Tensorflow

https://developers.google.com/machine-learning/crash-course/first-steps-with-tensorflow/toolkit

Big Data

Big data is not about the size of the data, it is about the value within the data.

Big Data Analytics

A collection of frameworks for generate valuable equation (regression)

MapReduce Framework
Hadoop Distributed File System (HDFS)
Cluster

Data in Big Data

Structured data
Semi-structure data
Unstructured data

Big Data 4V

Justification on deploy big data in 4 directions

Volume - Data Quantity
Velocity - Data Speed
Variety - Data Types
Varecity (Analytics)

Justification 4V via two level

Distributed Computation
Distributed Storage

Data Charactrristics

Activity Data
Conversation Data
Photo and Video image data
Sensor Data
The Internet of Things Data

AI - Artificial Intelligence

Machine Learning

Creating algorithms

Deep Learning

Using deep neural networks (NN) to automatically learn hierarchical representations

Machine Learning Task

Classification - Predict a class of an object
Regression - Predict a continuous value for an object
Clustering - group similar object together
Dimensionality reduction - " compress " data from a high- dimensional representation into a lower-dimensional one
Ranking
Recommendations - filter a small subset of objects from a large collection and recommend them to a user

Deep Learning

Commonly use deep learning toolkits

Caffe
CNTK
Tensorflow
Theano
Torch

Common use networks

ConvNets: AlexNet, OxfordNet, GoogleNet
RecurrentNets: plain RNN, LSTM/GRU, bidirectional RNN
Sequential modeling with attention.

Compare of toolkits reference
https://github.com/zer0n/deepframeworks/blob/master/README.md?utm_source=tuicool&utm_medium=referral

Reference

Convolutional Neural Networks for Visual Recognition
http://derekwaikl.blogspot.com/2018/06/convolutional-neural-networks-for.html

Deep Learning Tutorials
http://deeplearning.net/tutorial/

Stanford Deep Learning
http://deeplearning.stanford.edu/

Books to read:

Bengio Y. Learning Deep Architectures for AI[J]. Foundations & Trends® in Machine Learning, 2009, 2(1):1-127.

Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786):504-507.

He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. arXiv:1512.03385, 2015.

Srivastava R K, Greff K, Schmidhuber J. Highway networks. arXiv:1505.00387, 2015.

Logbook by Derek Wai

Search This Blog

Featured Post

Machine Learning, Big Data, AI, Deep Learning

Saturday, June 2, 2018

Machine Learning, Big Data, AI, Deep Learning

Machine Learning

Fundamental concept in Machine learning

Model types

Linear Regression

Reducing Loss

An iterative approach to training a model

Gradient descent to training a model

Stochastic gradient descent (SGD)

Tensorflow

Big Data

Big Data Analytics

Data in Big Data

Big Data 4V

Justification on deploy big data in 4 directions

Justification 4V via two level

Data Charactrristics

AI - Artificial Intelligence

Machine Learning

Deep Learning

Machine Learning Task

Deep Learning

1 comment: