Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Keras | 58,439 | 330 | 2 days ago | 68 | May 13, 2022 | 397 | apache-2.0 | Python | ||
Deep Learning for humans | ||||||||||
Scikit Learn | 54,380 | 18,944 | 6,722 | 4 hours ago | 64 | May 19, 2022 | 2,195 | bsd-3-clause | Python | |
scikit-learn: machine learning in Python | ||||||||||
Ml For Beginners | 48,640 | 2 days ago | 12 | mit | Jupyter Notebook | |||||
12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all | ||||||||||
Made With Ml | 33,193 | a month ago | 5 | May 15, 2019 | 11 | mit | Jupyter Notebook | |||
Learn how to responsibly develop, deploy and maintain production machine learning applications. | ||||||||||
Spacy | 26,214 | 1,533 | 842 | 2 days ago | 196 | April 05, 2022 | 107 | mit | Python | |
💫 Industrial-strength Natural Language Processing (NLP) in Python | ||||||||||
Ray | 25,773 | 80 | 199 | 3 hours ago | 76 | June 09, 2022 | 2,838 | apache-2.0 | Python | |
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for accelerating ML workloads. | ||||||||||
Data Science Ipython Notebooks | 25,025 | 23 days ago | 33 | other | Python | |||||
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines. | ||||||||||
Streamlit | 24,921 | 17 | 404 | 5 hours ago | 182 | July 27, 2022 | 615 | apache-2.0 | Python | |
Streamlit — A faster way to build and share data apps. | ||||||||||
Applied Ml | 24,242 | 2 days ago | 3 | mit | ||||||
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production. | ||||||||||
Ai Expert Roadmap | 24,033 | 3 months ago | 13 | mit | JavaScript | |||||
Roadmap to becoming an Artificial Intelligence Expert in 2022 |
The purpose of this repo is two fold:
The focus is on the knowledge breadth so this is more of a quick reference rather than an in-depth study material. If you want to learn a specific topic in detail please refer to other content or reach out and I'd love to point you to materials I found useful.
I might add some topics from time to time but hey, this should also be a community effort, right? Any pull request is welcome!
Here are the categorizes:
The only advice I can give about resume is to indicate your past data science / machine learning projects in a specific, quantifiable way. Consider the following two statements:
Trained a machine learning system
and
Designed and deployed a deep learning model to recognize objects using Keras, Tensorflow, and Node.js. The model has 1/30 model size, 1/3 training time, 1/5 inference time, and 2x faster convergence compared with traditional neural networks (e.g, ResNet)
The second is much better because it quantifies your contribution and also highlights specific technologies you used (and therefore have expertise in). This would require you to log what you've done during experiments. But don't exaggerate.
Spend some time going over your resume / past projects to make sure you explain them well.
The resources here are only meant to help you brush up on the topis rather than making you an expert.
Using PySpark API.
Given a data science / machine learning project, what steps should we follow? Here's how I would tackle it:
Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a validation set to evaluate it. For example, a k-fold cross validation divides the data into k folds (or partitions), trains on each k-1 fold, and evaluate on the remaining 1 fold. This results to k models/evaluations, which can be averaged to get a overall model performance.
Underfitting happens when a model is not complex enough to learn well from the data. It is the problem of model rather than data size. So a potential way to address underfitting is to increase the model complexity (e.g., to add higher order coefficients for linear model, increase depth for tree-based methods, add more layers / number of neurons for neural networks etc.)
For neural networks
y = 0.01x
when x < 0) to address dead ReLU issueTo address overfitting, we can use an ensemble method called bagging (bootstrap aggregating), which reduces the variance of the meta learning algorithm. Bagging can be applied to decision tree or other algorithms.
Here is a great illustration of a single estimator vs. bagging.
Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line—that is, a decision boundary—that separates the elephants and dogs. Then, to classify a new animal as either an elephant or a dog, it checks on which side of the decision boundary it falls, and makes its prediction accordingly.
Here’s a different approach. First, looking at elephants, we can build a model of what elephants look like. Then, looking at dogs, we can build a separate model of what dogs look like. Finally, to classify a new animal, we can match the new animal against the elephant model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in the training set.
Random forest improves bagging further by adding some randomness. In random forest, only a subset of features are selected at random to construct a tree (while often not subsample instances). The benefit is that random forest decorrelates the trees.
For example, suppose we have a dataset. There is one very predicative feature, and a couple of moderately predicative features. In bagging trees, most of the trees will use this very predicative feature in the top split, and therefore making most of the trees look similar, and highly correlated. Averaging many highly correlated results won't lead to a large reduction in variance compared with uncorrelated results. In random forest for each split we only consider a subset of the features and therefore reduce the variance even further by introducing more uncorrelated trees.
I wrote a notebook to illustrate this point.
In practice, tuning random forest entails having a large number of trees (the more the better, but
always consider computation constraint). Also, min_samples_leaf
(The minimum number of
samples at the leaf node)to control the tree size and overfitting. Always cross validate the parameters.
How it works
Boosting builds on weak learners, and in an iterative fashion. In each iteration, a new learner is added, while all existing learners are kept unchanged. All learners are weighted based on their performance (e.g., accuracy), and after a weak learner is added, the data are re-weighted: examples that are misclassified gain more weights, while examples that are correctly classified lose weights. Thus, future weak learners focus more on examples that previous weak learners misclassified.
Difference from random forest (RF)
XGBoost (Extreme Gradient Boosting)
XGBoost uses a more regularized model formalization to control overfitting, which gives it better performance
A feedforward neural network of multiple layers. In each layer we can have multiple neurons, and each of the neuron in the next layer is a linear/nonlinear combination of the all the neurons in the previous layer. In order to train the network we back propagate the errors layer by layer. In theory MLP can approximate any functions.
The Conv layer is the building block of a Convolutional Network. The Conv layer consists of a set of learnable filters (such as 5 * 5 * 3, width * height * depth). During the forward pass, we slide (or more precisely, convolve) the filter across the input and compute the dot product. Learning again happens when the network back propagate the error layer by layer.
Initial layers capture low-level features such as angle and edges, while later layers learn a combination of the low-level features and in the previous layers and can therefore represent higher level feature, such as shape and object parts.
RNN is another paradigm of neural network where we have difference layers of cells, and each cell not only takes as input the cell from the previous layer, but also the previous cell within the same layer. This gives RNN the power to model sequence.
This seems great, but in practice RNN barely works due to exploding/vanishing gradient, which is cause by a series of multiplication of the same matrix. To solve this, we can use a variation of RNN, called long short-term memory (LSTM), which is capable of learning long-term dependencies.
The math behind LSTM can be pretty complicated, but intuitively LSTM introduce
LSTM resembles human memory: it forgets old stuff (old internal state * forget gate) and learns from new input (input node * input gate)
scikit-learn implements many clustering algorithms. Below is a comparison adopted from its page.
Here is a visual explanation of PCA
[TODO]
The quick brown fox jumped over the lazy dog
. In this case each word (separated by space) would be a tokenO'Neill
can be tokenized to o
and neill
, oneill
, or o'neill
.aren't
into aren
and t
saw
, stemming might return just s
, whereas lemmatization would attempt to return either see
or saw
depending on whether the use of the token was as a verb or a nounThe quick brown fox jumped over the lazy dog.
the quick
, quick brown
, brown fox
, ..., i.e, every two consecutive words (or tokens)the quick brown
, quick brown fox
, brown fox jumped
, ..., i.e., every three consecutive words (or tokens)John likes to watch movies, especially horor movies.
, (2) Mary likes movies too.
We would first build a vocabulary of unique words (all lower cases and ignoring punctuations): [john, likes, to, watch, movies, especially, horor, mary, too]
. Then we can represent each sentence using term frequency, i.e, the number of times a term appears. So (1) would be [1, 1, 1, 1, 2, 1, 1, 0, 0]
, and (2) would be [0, 1, 0, 0, 1, 0, 0, 1, 1]
The software utility cron is a time-based job scheduler in Unix-like computer operating systems. People who set up and maintain software environments use cron to schedule jobs (commands or shell scripts) to run periodically at fixed times, dates, or intervals. It typically automates system maintenance or administration -- though its general-purpose nature makes it useful for things like downloading files from the Internet and downloading email at regular intervals.
Tools:
Using Ubuntu as an example.
sudo su
sudo apt-get install <package>
Confession: some images are adopted from the internet without proper credit. If you are the author and this would be an issue for you, please let me know.