Predict Harmful Text
Dataset
The dataset comprises a collection of tweets, each annotated to indicate whether it includes harmful content. The label '1' signifies harmful content, while '0' denotes content that is not harmful. To proceed, download the dataset and employ it within your .ipynb (Jupyter Notebook) environment to train and refine your model.
Here is a preview of the dataset structure:
- The 'Text' column contains the tweet text.
- The 'Target' column contains the label, where '1' corresponds to "harmful" and '0' corresponds to "not harmful".
Notice the text comes with corresponding labels, making it a supervised machine learning (ML) task.
As this involves textual data, text preprocessing is essential—this includes steps like tokenization and encoding to convert text into a format suitable for ML algorithms.
Consider the nature of the problem. Since the target variable has two possible outcomes, we're dealing with a binary classification task.
When it comes to text-based tasks, what models come to mind? Deep learning models like BERT and GPT are prominent choices due to their powerful language understanding capabilities.
Prepare your deep learning model. Ensure to split your data into training, validation, and testing subsets, choose a fitting loss function such as binary cross-entropy for binary outcomes, select an optimizer like Adam for efficient learning, and construct your neural network layers thoughtfully.
Decide on evaluation metrics that will effectively measure your model's performance. Precision, recall, and accuracy are standard metrics, but considering the specifics of the task, such as the cost of false positives versus false negatives, can guide you in emphasizing the most relevant metric for your particular problem.
Please note that this is just one way to perform this ML task. By following the steps outlined in the hints above, you could derive a solution that looks quite different from the below e.g. you may use other encoding/tokenizing methods for text preprocessing. You may also use a different model with different layer configurations (if applicable).
Pythonimport pandas as pd import numpy as np # Load dataset and remove an unused column data_df = pd.read_csv("tweets_flagged_v2.csv") data_df = data_df.drop(columns=["Unnamed: 0"]) # Display the first 5 rows to check the data data_df.head(5) # Display the distribution of the 'harmful' column values data_df["harmful"].value_counts() # Randomly sample 10 tweets and their 'harmful' status for inspection for _ in range(10): random_ind = np.random.randint(0, len(data_df)) random_data = data_df.iloc[random_ind] print(random_data["tweet"], random_data["harmful"]) # Preprocess text data from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-cased") # Convert tweets to a NumPy array for processing X = data_df["tweet"].values y = data_df["harmful"].values # Tokenize tweets, add padding to ensure uniform sequence length, and convert to TensorFlow tensors sequences = [sequence for sequence in X] model_inputs = tokenizer(sequences, padding=True, return_tensors='tf') import tensorflow as tf # Create a TensorFlow dataset from the tokenized inputs and labels dataset = tf.data.Dataset.from_tensor_slices((model_inputs['input_ids'],y)) # Optimize dataset by caching, shuffling, batching, and prefetching dataset = dataset.cache() dataset = dataset.shuffle(160000) dataset = dataset.batch(16) dataset = dataset.prefetch(8) # Split dataset into training, validation, and testing sets following a 70:20:10 ratio train = dataset.take(int(len(dataset)*.7)) val = dataset.skip(int(len(dataset)*.7)).take(int(len(dataset)*.2)) test = dataset.skip(int(len(dataset)*.9)).take(int(len(dataset)*.1)) from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense # Define a sequential model architecture model = Sequential(name="text-classifier") model.add(Embedding(len(tokenizer.get_vocab()), 32)) # Embedding layer model.add(Bidirectional(LSTM(32, activation='tanh'))) # Bidirectional LSTM layer model.add(Dense(128, activation='relu')) # Dense layer with ReLU activation model.add(Dense(256, activation='relu')) # Another Dense layer with ReLU model.add(Dense(128, activation='relu')) # Additional Dense layer with ReLU # Output layer with sigmoid activation for binary classification model.add(Dense(1, activation='sigmoid')) # Display model summary model.summary() # Compile model with binary cross-entropy loss and Adam optimizer model.compile(loss="binary_crossentropy", optimizer='Adam') # Train model on GPU, specifying the training and validation datasets with tf.device("/device:GPU:0"): history = model.fit(train, epochs=1, batch_size=16, validation_data=val) from tensorflow.keras.metrics import Precision, Recall, Accuracy # Initialize precision, recall, and accuracy metrics pre = Precision() rec = Recall() acc = Accuracy() # Evaluate model on test dataset and update metrics for batch in test.as_numpy_iterator(): x_true, y_true = batch y_hat = model.predict(x_true) # Predict on the test batch # Update precision, recall, and accuracy metrics based on predictions pre.update_state(y_true, y_hat) rec.update_state(y_true, y_hat) acc.update_state(y_true, y_hat) # Print precision, recall, and accuracy values print("precision", pre.result().numpy()) print("recall", rec.result().numpy()) print("accuracy", acc.result().numpy())
Below is a supplemental written solution that utilizes our interview framework and has been checked for completeness and accuracy. Please note that the filmed mock interview video mimics a real-world interview as closely as possible and may not represent a complete solution. Remember, every candidate’s technical skills and approach are unique. In this session, the interviewer provides additional feedback about monitoring the model, detecting overfitting, and describing the BERT model.
Introduction
In this mock interview, Jayaram (MLE @ Delivery Hero) is given a dataset of Twitter tweets and prompted to build a system that classifies harmful text.
Step 1: Understand the problem
To start, we’ll ask some clarifying questions:
- Does the dataset have any features?
- Does the dataset contain labels? This clarifies whether the ML task is a supervised or unsupervised.
- Does the dataset contain more than 2 labels? This clarifies if this is a binary or multi-label classification problem.
- How often should the predictions be computed?
- Do we want one model per geography? This helps define the training overhead, data storage, and data privacy laws.
- Will we use predictions for downstream tasks? If so, we may want to generate prediction probabilities.
- Are we okay if the not-harmful text is marked harmful? This informs how we’ll select evaluation metrics (accuracy vs. precision).
- Is there a model already in production? This informs how we’ll set up A/B testing, a roll out strategy, and key business metrics.
Step 2: Discuss the approach
We’ll build a text classifier through multiple components:
- Text vectorization
- Deep learning model
- Inference component
Text vectorization
We’ll convert raw text to a sequence of integers/floats, which the ML algorithm will use to derive patterns and perform predictions. To do so, we’ll complete the following steps: Pre-process the text by removing punctuations and stop words, lemmatizing, and standardizing. Build a vocabulary, or map, that stores a word or token against an integer. Select a feature size for every token in the vocabulary, which captures contextual understanding of every token present in the corpus. Pad sequences with zeros so the training data is of equal shape.
All the above steps can be achieved by using the HuggingFace tokenizer. We’ll use the BERT Tokenizer (cased), based on subword tokenization.
Deep learning model
We’ll build a deep learning model using Tensorflow Keras Sequential API. The embedding layer will learn about tokens while training. This layer can also be exported and used as standalone in future tasks. The bidirectional long short-term memory (LSTM) layer is Tensorflow’s default option for GPU acceleration. The bidirectional layer allows us to pass information back and forth for the LSTM layer. The dense layers with ReLu activations will capture non-linearities. The final layer of size 1 will use sigmoid activation, to modify values to be between [0, 1]. We’ll use the binary cross entropy loss function and Adam optimizer.
Inference component
We’ll train on the training dataset, over 10 epochs and in batches of size 64. The train, validation, and test split will be split in a 70:20:10 ratio, respectively. Then, we’ll observe the train vs validation loss.
Classifier Class Signature
Pythonclass TextClassifier:
def __init__(self, data: pd.DataFrame):
self.data = data
self.model = Sequential()
self.vectoriser = TextVectorization()
...
def preprocess(self, target_column: str) -> pd.DataFrame:
pass
def train(self, num_epochs: int, batch_size: int):
pass
def infer(self, text: str) -> str:
pass
Step 3: Implement the algorithm
Pythonimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Bidirectional
from tensorflow.keras.metrics import Precision, Recall, Accuracy
from transformers import BertTokenizer
class TextClassifier:
def __init__(self, data: pd.DataFrame):
self.data = data
self.model = Sequential()
self.tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
self.vectorized_text = None
def preprocess(self, target_column: str):
self.X = self.data['tweet']
self.y = self.data[target_column].values
sequences = [sequence for sequence in self.X.values]
model_inputs = self.tokenizer(sequences, padding=True, return_tensors="tf")
self.vectorized_text = model_inputs["input_ids"]
def _prepare_training_dataset(self, vectorized_text: pd.Series):
dataset = tf.data.Dataset.from_tensor_slices((vectorized_text, self.y))
dataset = dataset.cache()
dataset = dataset.shuffle(160000)
dataset = dataset.batch(16)
dataset = dataset.prefetch(8) # helps bottlenecks
train = dataset.take(int(len(dataset)*.7))
val = dataset.skip(int(len(dataset)*.7)).take(int(len(dataset)*.2))
test = dataset.skip(int(len(dataset)*.9)).take(int(len(dataset)*.1))
return train, val, test
def _build_model(self):
# Create the embedding layer
self.model.add(Embedding(len(self.tokeniser.get_vocab)+1, 32))
# Bidirectional LSTM Layer
self.model.add(Bidirectional(LSTM(32, activation='tanh')))
# Feature extractor Fully connected layers
self.model.add(Dense(128, activation='relu'))
self.model.add(Dense(256, activation='relu'))
self.model.add(Dense(128, activation='relu'))
# Final layer
self.model.add(Dense(1, activation='sigmoid'))
self.model.compile(loss='binary_crossentropy', optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001), metrics = ["accuracy"])
@property
def get_model_summary(self):
return self.model.summary()
@property
def get_training_history(self):
return self.history
def train(self, epochs: int = 10, batch: int = 16):
train, val, test = self._prepare_training_dataset(vectorized_text=self.vectorized_text)
self._build_model()
# uncomment when using a GPU
# with tf.device('/device:GPU:0'):
self.history = self.model.fit(train, epochs=epochs, batch_size=batch, validation_data=val)
precision, recall, accuracy = self.evaluate(test=test)
print(f'Precision: {precision.result().numpy()},\
Recall:{recall.result().numpy()},\
Accuracy:{accuracy.result().numpy()}'
)
def evaluate(self, test):
pre = Precision()
rec = Recall()
acc = Accuracy()
for batch in test.as_numpy_iterator():
# Unpack the batch
X_true, y_true = batch
# Make a prediction
yhat = self.model.predict(X_true)
# Flatten the predictions
y_true = y_true.flatten()
yhat = yhat.flatten()
pre.update_state(y_true, yhat)
rec.update_state(y_true, yhat)
acc.update_state(y_true, yhat)
return pre, rec, acc
# load the dataset
df = pd.read_csv('tweets_flagged.csv')
clf = TextClassifier(data=df)
clf.preprocess(target_column='harmful')
clf.train(epochs=10, batch=64)
clf.model.save(‘harmful_text_classsifier.h5’)
Step 4: Test code & discuss results
Python# load the model model=tf.keras.models.load_model('harmful_text_classsifier.h5') vectorised_text = clf.tokenizer([text], padding=True, return_tensors="tf")['input_ids'] res = model.predict(vectorised_text) if (res >= 0.5).astype(int): print(res, "Harmful") else: print(res, "Not Harmful")
Now that we have the results, we’ll review the following:
- Training vs. validation loss. We’ll see if the values converge over epochs, and if the loss and val_loss improves on every epoch.
- Signs of overfitting and underfitting. In cases of overfitting, we can utilize techniques like regularization and batch normalization. In cases of underfitting, we might check if the features are sufficient, consider better text encoding techniques, or consider pre-trained embeddings (e.g. word2vec, glove).
- Test runs to verify that predictions are accurate.
To improve the model, we can consider the following:
- Enable the model to capture contextual information across positive and negative words.
- Train the model on more epochs.
- Improve the text vectorizer and pre-processing data pipeline.
- Use transfer learning techniques.
Interview improvements
Some opportunities for improvement in the interview include:
- Suggesting different approaches for the data science algorithm, instead of jumping directly to using neural networks and LSTMs. We could discuss the algorithm’s drawbacks in fitting to the current problem.
- Justifying the chosen evaluation metrics with more clarity. We could have started with why accuracy is a best fit for this particular case, and what the downsides of accuracy are in cases of imbalanced data.
- Explaining the attention mechanism more effectively. We could have started with what LSTMs lack, especially when learning from long sentences. For example, LSTMs focus on the end words, the information earlier in the sentence is missed. A better way to assess a sentence is to focus on the most important parts of the sentence by assigning weights.