ML Interviews Glossary
Need help with some machine learning concepts? Here's a list of common vocabulary used by ML engineers, organized into four categories:
- Data handling
- Model selection and optimization
- Evaluation methods and metrics
- Deployment
Data handling
0-1 scaling: A preprocessing technique that transforms feature values to all fall in the range of [0, 1]. Similar to standardization, this also helps machine learning algorithms learn.
Akaike information criterion (AIC): A way of evaluating how well a model fits a set of data while also balancing the complexity of the model. Lower AIC values will indicate a “better” model. The score will be lower if the model has fewer parameters and a higher likelihood of the model having produced the data. This score can be used for model selection when there are multiple models to choose from.
Bayes information criterion (BIC): A way of evaluating and selecting models, similar to the Akaike information criterion. In practice, the main difference from the AIC is that it applies a stronger penalty to more complex models, especially as sample size increases.
Bucketing (pd.cut): A type of feature engineering that bins continuous values into discrete buckets. For example, if you have a feature representing someone’s age, you could bucket these into age ranges. These buckets could be useful because they could represent important information about patterns in the data.
Count vectorizer: A way of feature engineering unstructured text data. It considers all possible tokens that can occur and constructs a vocabulary, typically by iterating through a corpus of text. Every token in the vocabulary is assigned an index. Then, a piece of text will be converted to a vector, where the index of the vector will represent the number of occurrences of the corresponding token in that text.
Missing data: This refers to values not present for certain features for some number of examples in your data set. There could be various reasons for missing data, which could occur in your training set or at inference time. It’s often necessary to handle these because many machine learning algorithms require the input data to have values defined for all their features. Standard techniques to handle missing data are described below.
- Remove: Removing examples that have missing feature values from your dataset. You can also remove features entirely if there aren’t enough examples with values for that feature.
- Fill in methods: Filling in the missing values with a different value. There are a variety of methods to fill in data, all with their pros and cons.
- Means: You can fill in missing values with the average of all the non-missing values for a particular feature.
- Linear interpolation: When you have data that follows a certain order (e.g. time series), you can fill in missing values by creating a line between two data points surrounding one or more missing values in your dataset. Then, you can fill the missing points with where the values would lie on the line that was created.
- Clustering and similarities: Groups, also known as “clusters,” can be formed based on the similarity of certain examples in the data. Examples with missing data can be assigned to clusters based on their similarity, and the missing values can be filled in with statistics from that particular cluster (e.g. the mean).
One-hot encoding: A way of handling categorical data for machine learning models that require numerical feature values. It considers all the possible values a feature can take on and creates separate features (or columns) for all of those categories. Then, for a particular example, it will assign a value of “0” or “1” to the new columns corresponding to each category based on whether the data point’s feature value was equal to that category.
Percentile interpolation: Inferring missing values with values from the percentiles of the distribution (e.g. 25th, 50th, 75th, etc.).
Remove outliers: In machine learning, outliers could be examples that are not representative of the general patterns in the data. It can be useful to remove these to help models learn and predict outcomes in alignment with what’s seen in the training data. However, this should be done with care as outliers may contain important information, depending on why the example is an outlier.
Standardization: A preprocessing technique that transforms feature values such that the distribution of these values has a mean of 0 and a standard deviation of 1. Keeping feature values on the same scale can help machine learning algorithms learn patterns in the data.
TF-IDF vectorizers: Term frequency-inverse document frequency is a modification of count vectorization that considers the frequencies of tokens within a document and the number of documents in which it occurs. It does this to determine the relative importance of the term. Instead of taking raw count for a token, TF-IDF will multiply the term frequency (the relative frequency of the token within the piece of text) by the inverse document frequency (the logarithmic inverse of the number of terms that the document occurs in, compared to the total number of documents). Different variants of the formula are used, but the general idea is the same. Terms with a higher relative frequency within a document and more rarely across a document corpus will give a higher TF-IDF score.
Transformations: Operations that are applied to input data to correctly put it into a machine learning model. You can apply transformations to numerical transformations, categorical, and text data.
Word embeddings: A way of feature engineering unstructured text to incorporate semantic meaning. In practice, a word will be assigned a vector (an embedding), and the vector will be learned to represent information about the word itself. An embedding will typically be a component of a machine learning model trained on a meaningful task, such as predicting the next word in a sequence.
Model selection and optimization
Activation functions: Functions applied within neural networks, specifically after a layer of the network. These are used to introduce nonlinearity into the function that the network computes.
Area under the ROC Curve (AUC): a classification metric that measures the total area under the ROC curve for a classification model. A larger area means that the slopes of the curve trend upwards, which means that changes in threshold lead to a larger change in true positive rate compared to false positive rate. Thus, a larger area indicates better performance.
Autoregressive integrated moving average (ARIMA): A common statistical analysis model used for forecasting time series data.
Bias-variance tradeoff: The expected generalization error of a machine learning model can be decomposed into its bias (the difference between the predicted outcome and actual outcome) and the variance (the spread and instability of a model’s predictions). There is typically a tradeoff between the two. For example, methods that reduce a model's bias can also increase its variance.
Classification method: A way of classifying examples into different categories given input data.
Coefficient interpretation: One of the main benefits of linear regression. You can use the learned coefficients to gauge how important the feature that corresponds to the coefficient is.
Cosine similarity: A similarity measure for two vectors. It takes their dot product and divides this by their magnitudes, which is mathematically equivalent to the cosine of the angle between the two vectors. The result ranges from -1 (meaning that they are in opposite directions) to 1 (meaning that they are in the same direction). A similarity of 0 means that the two vectors are orthogonal to each other.
Crop: An image preprocessing technique that takes a part of the image and uses it as a data point to train on. This is commonly used because it can help the model focus on certain parts of the image.
Deep learning: A subfield of machine learning focused on neural networks.
Dropout: A technique in training neural networks where nodes in the network are dropped at random on each iteration. This helps reduce the capacity of the network, which helps reduce overfitting in practice.
Ent: A method of determining the best split point for a node in a decision tree. It will calculate the probability of incorrectly classifying an example based on all the examples in a node.
Epochs: The number of training iterations through the entire dataset.
Entropy: A measure of impurity or disorder in a set of data. With decision trees, this will be calculated based on the proportions of elements in each respective class.
Euclidean distance: The distance between two points in a Euclidean space, which is the same as the distance formula from 2D points in algebra. However, it can also be calculated with vectors of any dimension.
Exp(): Used commonly in machine learning activation functions, particularly for the sigmoid and softmax. The exponential function has a few useful properties, including creating a smooth curve and being convenient for gradient-based optimization.
Feature importance: With tree-based methods, relative importance for each feature can be computed by pooling the split criteria values for particular features in the ensemble. For example, feature importance could be calculated by calculating the average decrease in gini impurity for a feature across all trees.
Gini index: A method of determining the best split point for a node in a decision tree. It will calculate the probability of incorrectly classifying an example based on all the examples in a node.
Gradient boosting: Another ensemble method that combines predictions from multiple “weak learners.” In practice, these are decision trees that have a height of 1 (also called stumps) but are trained in a stagewise fashion to learn the pattern over time. They do this by iteratively learning to predict the errors and then using the gradient of a chosen loss function to move the trees’ predictions closer to the actual values.
Gradient method: an optimization method that uses the gradient of a function for its independent variables to take gradual steps towards a local or absolute minimum of the function.
Hierarchical clustering: A type of clustering that puts clusters into hierarchies. This could be used to create larger groups from smaller subgroups or vice versa.
Information gain: Another method of determining the best split point for a node in a decision tree. Information gain will measure how much entropy was removed after a split in the data. It is calculated by considering the entropy of the labels and subtracting the entropy of the data subsets after the data is split.
Inverse optimization: An advanced concept that, unlike standard optimization, does not assume the objective function and constraints. Instead, it observes the data and works backward to infer the optimal objective function function.
K-means clustering: An unsupervised machine learning algorithm that assigns data points to centroids. The centroids are initially set as vectors with random values. During the learning process, it assigns each data point to the closest centroid, and then recalculates the centroid as the mean of all the data points in that centroid. The algorithm does this continually until the centroids are no longer moving, which means the algorithm has converged.
Lasso regression: A modification of linear regression that introduces a weight penalty to the loss function. More specifically, the L1 norm of the weights is added, which results in certain learned coefficients being reduced to zero. This is a type of regularization used to prevent overfitting to training data. It can also be used for variable selection.
Layers: A subunit in a neural network composed of interconnected nodes. In practice, layers are implemented with matrix multiplications of many numbers. Different types of layers can perform these multiplications in different ways.
Linear regression: A machine algorithm that models the relationship between inputs and the output as the line of best fit, which is the linear equation that minimizes the mean squared error.
Log odds: Defined by dividing the probability by 1 minus the probability, and taking the natural logarithm. This is most relevant to logistic regression, where rearranging the equation of the model shows that a linear combination of the input features is equal to the log odds. This can be used to interpret logistic regression. Increasing a feature value by 1 will increase the log odds by the value of the coefficient associated with that feature.
Logistic function: The generalized version of the sigmoid function. This squashes numbers between 0 and 1. It is typically used to get classification models to output probabilities.
Logistic regression: A parametric model used for classification. It takes a linear combination of the input features and passes it through an activation function to output predicted probability. The choice of activation depends on whether logistic regression is being used for single-class or multi-class classification.
Loss functions: Functions used to compute the error between the predicted values and actual values. These are crucial to training neural networks, as they aim to choose the weights to minimize the loss on the training data.
Manhattan distance: Measures the distance between two points as the sum of the absolute differences between their respective coordinate values. As with Euclidean distance, this can be calculated with vectors of any dimension.
Nodes: A single unit in a neural network. These are typically connected to many other incoming and outgoing nodes. Given a particular node, incoming nodes will supply a weighted sum. After an activation function is applied, the final value will be sent to connected, outgoing nodes to produce the weighted sum of that node.
Optimizers: Methods that speed up the training process in neural networks. One common example is momentum, where the current gradient is computed as a rolling average of the current gradient along with past gradients.
P-value interpretation: The probability of observing a particular statistic (or a value that is more extreme) under the assumption that the null hypothesis is true.
Principal component analysis: An unsupervised dimensionality reduction technique that transforms the original features into a set of components. The set retains most of the variation in the original data by converting the input features into an orthonormal basis, which helps capture the maximum variance in the data.
Prophet: A framework written by Meta that makes it easier to model time series data. It provides functionality for forecasting, seasonality, and evaluation.
PyTorch: A Python framework used to design and train deep neural networks. Specifically, it implements a tensor library that works with GPUs. These can accelerate computation significantly. It also has built-in functionality to calculate gradients, perform backpropagation, and use optimizers. It has layers of abstraction for various types of neural layers and has more recently begun to support distributed training.
Random forests: An ensemble algorithm that combines the predictions of multiple decision trees to improve performance on unseen data. Since decision trees tend to overfit the training data, random forests take a random subset of features and examples. This ensures that the algorithm doesn’t rely too heavily on a particular feature or example to learn its decision boundaries.
Receiver operating characteristic curve (ROC): A curve that plots the true positive rate (TPR) and false positive rate (FPR) at various classification thresholds. TPR is on the y-axis and FPR is on the x-axis. TPR is equal to the recall, and FPR is defined by the false positives divided by the total number of negative examples. The curve is used to gauge the classifier's behavior at different thresholds and overall performance.
Regularization: The purpose of regularization is to reduce overfitting on training data and improve performance on unseen data. It is used when a model is overfitted to the data it was trained on and has poor performance on unseen data. Types of regularization techniques include lasso and ridge.
Resize: Another image preprocessing technique that changes the size of the image in a dataset. This is often necessary because trained neural network models have a configuration for the input size. The number of pixels in the input must match the number of parameters in the input layer of the network.
Ridge regression: Another variant of linear regression that introduces a weight penalty to the loss function. In this case, the L2 norm of the weights is added, which typically minimizes the learned coefficients to smaller values, but not to zero. Like Lasso regression, it can be used for regularization and variable selection.
Seasonality: In time series data, this refers to predictable changes expected to happen every year.
SHapley Additive exPlanations (SHAP) values: A concept, based in game theory, that can be used to compute feature importance. It models features as “players” and predictions as “payout.” It provides instance-level understanding of a model’s behavior, and can be used for interpreting individual predictions and features in a model.
Single decision tree: Decision trees that use a series of decisions to iteratively split the training data and eventually predict the outcome.
Step-wise regression: Refers to selecting variables for a regression model in a series of steps. On each iteration, certain variables will be selected based on statistical significance. Then, a new model will be trained on those variables. This can be done multiple times to develop a more accurate model.
Supervised learning: A type of machine learning that predicts an outcome based on known input variables. Supervised learning algorithms are trained on known outcome data, which is how they learn to detect relationships between the known inputs and outcomes.
The 5 assumptions: Refers to the five assumptions made when using linear regression.
- There’s a linear relationship between the input variables and the outcome. Various methods, including scatter plots and statistical methods such as the correlation coefficient and analysis of variance (ANOVA), can test this.
- Observations are independent of each other. There are many ways of testing this. You could plot the residuals and identify patterns or measure the correlation between a pair of independent variables.
- Residuals are normally distributed. This can be tested by plotting the residuals and calculating the mean.
- Residuals have a constant variance across all values of the input variables. This can be tested by plotting residuals against each independent variable and seeing if the spread looks relatively similar.
- Residuals are independent of each other. This can be tested by plotting residuals against different examples and seeing if there are any patterns.
When these assumptions are violated, the following can happen:
- Predicted outcomes from the model may be incorrect.
- Any interpretation of the coefficients of the model may be misleading. Moreover, results from hypothesis testing on the coefficients of these independent variables may be incorrect.
- Confidence intervals constructed from the model may be incorrect.
Tree-based models: Machine learning models that are based on decision trees. Decision trees use a series of decisions to iteratively split the training data and eventually predict the outcome. These algorithms can split the data in different ways, but it’s common to use criteria based on information theory, such as the gini index or entropy.
Unsupervised learning: A type of machine learning that only finds patterns based on the features of the input. It does not learn based on associated outcomes. A common example is clustering.
Variable selection: Choosing variables for a model based on certain criteria. Typically, these are chosen based on whether those variables are more important in predicting the outcome.
Variable usefulness: Supervised machine learning algorithms often have ways of determining which input variables are the most important in predicting outcomes. Assessing which variables are more important can be helpful in many cases, including variable selection and general statistical analysis.
Visit our Model & Algorithm Fundamentals module to learn all you need to know about the most commonly tested models and algorithms in interviews.
Evaluation methods and metrics
Accuracy: a classification metric that measures the total number of correctly predicted examples divided by the total number of examples in the dataset. Accuracy is not a suitable metric for datasets with class imbalance. For example, a spam classifier could achieve a very high accuracy on a dataset by predicting all emails as non-spam (since most emails in a given dataset will be non-spam). This would be misleading, as it does not accurately gauge performance on how well it predicts spam emails.
Area under the ROC Curve (AUC): a classification metric that measures the total area under the ROC curve for a classification model. A larger area means that the slopes of the curve trend upwards, which means that changes in threshold lead to a larger change in true positive rate compared to false positive rate. Thus, a larger area indicates better performance.
Class imbalance: In classification problems, this refers to a difference in the proportions of each class in the training data. Training on data with class imbalance can result in models that only perform well on classes that occur frequently in the training data. Strategies to handle class imbalance include oversampling, undersampling, and SMOTE, although these strategies do not always work well because they may not actually introduce new information to the training set. SMOTE attempts to balance diversity and information from data points in the minority class, but it empirically shows mixed results, depending on the dataset.
F1 score: a classification metric that measures the harmonic mean of the precision and recall. This is used to combine both precision and recall into one metric.
Mean absolute error (MAE): a regression metric that measures the mean of the absolute values of the differences between the predicted and actual values. It’s appropriate to use MAE over MSE when you have data with outliers. For extreme values, MSE will increase the impact of larger errors that typically occur with outliers.
Mean squared error (MSE): a regression metric measuring the average of the squared differences between the predicted and actual values.
Precision: a classification metric that measures how many positive examples the model predicted correctly out of all the examples it predicted as positive. It is calculated by dividing the true positives by the sum of the true positives and false positives.
Receiver operating characteristic curve (ROC): A curve that plots the true positive rate (TPR) and false positive rate (FPR) at various classification thresholds. TPR is on the y-axis and FPR is on the x-axis. TPR is equal to the recall, and FPR is defined by the false positives divided by the total number of negative examples. The curve is used to gauge the classifier's behavior at different thresholds and overall performance.
Recall: a classification metric that measures how many positive examples the model predicted correctly out of all the positive examples in the dataset. It is calculated by dividing the true positives by the sum of the true positives and false negatives.
Ready to dive deeper? Check out our lesson on Supervised Model Evaluation.
ML in production
A/B testing: An experimentation process used to compare two model versions. There is typically a control group and a test group, and a metric of interest is measured for each group. Hypothesis testing can be used to gauge whether or not there is a significant difference in the metric for the two groups. This is a very common method of evaluating machine learning models in production and is typically used to decide which model to deploy.
Bandits: Algorithms that help balance between exploration (looking at different paths with different potential rewards) and exploitation (going further down a path to maximize reward). In machine learning, bandits can be used to evaluate and determine the best model in production.
Batch prediction: A type of model inference that happens asynchronously, rather than in response to a request. These can be triggered as part of a pipeline or scheduled as jobs at a specific time and frequency. For example, a company may have a batch job to precompute embeddings that they apply to each user on their platform. These embeddings will be expensive to compute in real-time and may not require features that the users supply at the moment (e.g. a user’s age and gender are known beforehand).
Canary release: A deployment technique used to evaluate a new model without heavily impacting users. The new model gets deployed in parallel with the existing model, and only a portion of the user traffic will be routed to the new model. The amount of traffic served can be adjusted over time depending on the new model’s performance.
Concept drift: A type of distribution shift that occurs when there is a change in the distribution of the dependent variable given the input features. There is a change in the conditional probability distribution that represents the machine learning model.
Containers: A unit of a software that encapsulates dependencies, source code, and configuration of a service. This is used to more cleanly manage the environment and isolate it from other pieces of software in the system.
Covariate shift: A type of distribution shift that refers to when there is a change in the distribution of a model’s input features.
Edge computing: Refers to moving the necessary computations to the “edge.” Computing resources are used on the consumer device that uses the machine learning model. This could be a mobile device, laptop, or embedded device.
Feature flags: Used to enable features or models in production. They are useful with A/B testing and canary releases.
Kubernetes: An open-source software for deployment and scaling of containers. It can be used to efficiently manage resources, create complex applications, and monitor existing services.
Label shift: A type of distribution shift that occurs when there is a change in the distribution of the model’s dependent variable.
Model compression: Making a machine learning model smaller to improve the speed at which a model makes predictions in production. Common methods include low-rank factorization, quantization, and knowledge distillation.
Model drift: Refers to model performance degrading in production over time, due to the change in the relationship between the input features and outcome.
Model retraining: Training a model further to rectify degraded performance in production. This could be done offline or in an automated fashion. Usually, the model will be trained further on new data that has been collected.
Monitoring: Watching metrics and behavior of a machine learning model while it is in production. This could involve looking at distributions of input features and predictions and monitoring inference speed.
Prediction service: A component of a pipeline or system that serves predictions from a machine learning model.
Real-time prediction: A type of model inference served quickly in response to a request. For example, serving search ranking results in response to a customer query.
Shadow deployment: A deployment technique used to evaluate a new model without impacting the users. The new model gets deployed in parallel with the existing model, and predictions are logged but not served to users.
Telemetry data: Collection of data, logs, and metrics in a system that is used for monitoring the system.
See something missing from this list? Email us at [email protected], and we'll add the word to this page!