Generative AI - Practical Coding: Implementing Transformers

Practical Coding: Implementing Transformers

Using TensorFlow and Keras, we will take a simple method to making a Transformer model from scratch. The Transformer model doesn't need recurring structures to handle the links between inputs; it instead uses attention processes. This makes it suitable for sequence-to-sequence tasks like translation and text generation.

1. Import Libraries

import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np

For making the model, we import TensorFlow and Keras. For math tasks, we import NumPy.

2. Define the Scaled Dot-Product Attention

class ScaledDotProductAttention(layers.Layer):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def call(self, query, key, value, mask=None):
        matmul_qk = tf.matmul(query, key, transpose_b=True)
        dk = tf.cast(tf.shape(key)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
        output = tf.matmul(attention_weights, value)

        return output, attention_weights

This class describes how the scaled dot-product attention system working. After figuring out the attention scores and scaling them, it uses a softmax function to get the attention weights. This is followed by finding the result, which is the weighted sum of the value vectors.

3. Define the Multi-Head Attention


class MultiHeadAttention(layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads
        self.wq = layers.Dense(d_model)
        self.wk = layers.Dense(d_model)
        self.wv = layers.Dense(d_model)
        self.dense = layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        scaled_attention, attention_weights = ScaledDotProductAttention()(q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))

        output = self.dense(concat_attention)

        return output, attention_weights
These classes describe the multi-head attention system. After dividing the input into several heads, it applies scaled dot-product attention to each head and then joins the results together. For this reason, the model can focus on data from various representation subspaces at the same time.

4. Define the Encoder Layer

class EncoderLayer(layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = tf.keras.Sequential([
            layers.Dense(dff, activation='relu'),
            layers.Dense(d_model)
        ])

        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)

        return out2

It describes a single decoder layer that has a feed-forward network and multi-head attention. For regularization and stability, layer normalization and dropout are used.

5. Define the Transformer Encoder

class Encoder(layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = self.positional_encoding(maximum_position_encoding, d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
        self.dropout = layers.Dropout(rate)

    def positional_encoding(self, position, d_model):
        angle_rads = self.get_angles(np.arange(position)[:, np.newaxis], np.arange(d_model)[np.newaxis, :], d_model)
        sines = np.sin(angle_rads[:, 0::2])
        cosines = np.cos(angle_rads[:, 1::2])
        pos_encoding = np.concatenate([sines, cosines], axis=-1)
        pos_encoding = pos_encoding[np.newaxis, ...]
        return tf.cast(pos_encoding, dtype=tf.float32)

    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
        return pos * angle_rates

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x

It includes embedding and positional encoding layers. This class describes the Transformer decoder. Along with its own attention and feed-forward networks, each decoder layer is stacked on top of the others.

6. Compile and Train the Model

# Parameters
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
input_vocab_size = 8500
maximum_position_encoding = 10000

encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding)

# Dummy input for demonstration
sample_input = tf.random.uniform((64, 62), dtype=tf.int64, minval=0, maxval=200)

# Model Call
sample_output = encoder(sample_input, training=False, mask=None)
print(sample_output.shape)  # (batch_size, input_seq_len, d_model)

Explanation:

Set the maximal position encoding, model dimension, feed-forward dimension, number of heads, and input word size.

Encoding: Create an instance of the encoder with the given settings.

Create a fake input to show how the encoder works.

Model Call: Let's test the solution by running the sample input through the encoder and printing the shape that comes out.

Doing these steps will help you make a basic Transformer model from start while learning about the main parts and how they work together. You can better understand Transformers and their powerful attention systems by doing this useful code activity.

logo

Generative AI

Beginner 5 Hours

Practical Coding: Implementing Transformers

Using TensorFlow and Keras, we will take a simple method to making a Transformer model from scratch. The Transformer model doesn't need recurring structures to handle the links between inputs; it instead uses attention processes. This makes it suitable for sequence-to-sequence tasks like translation and text generation.

1. Import Libraries

import tensorflow as tf from tensorflow.keras import layers, models import numpy as np

For making the model, we import TensorFlow and Keras. For math tasks, we import NumPy.

2. Define the Scaled Dot-Product Attention

class ScaledDotProductAttention(layers.Layer): def __init__(self): super(ScaledDotProductAttention, self).__init__() def call(self, query, key, value, mask=None): matmul_qk = tf.matmul(query, key, transpose_b=True) dk = tf.cast(tf.shape(key)[-1], tf.float32) scaled_attention_logits = matmul_qk / tf.math.sqrt(dk) if mask is not None: scaled_attention_logits += (mask * -1e9) attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) output = tf.matmul(attention_weights, value) return output, attention_weights

This class describes how the scaled dot-product attention system working. After figuring out the attention scores and scaling them, it uses a softmax function to get the attention weights. This is followed by finding the result, which is the weighted sum of the value vectors.

3. Define the Multi-Head Attention


class MultiHeadAttention(layers.Layer): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() self.num_heads = num_heads self.d_model = d_model assert d_model % self.num_heads == 0 self.depth = d_model // self.num_heads self.wq = layers.Dense(d_model) self.wk = layers.Dense(d_model) self.wv = layers.Dense(d_model) self.dense = layers.Dense(d_model) def split_heads(self, x, batch_size): x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) return tf.transpose(x, perm=[0, 2, 1, 3]) def call(self, v, k, q, mask): batch_size = tf.shape(q)[0] q = self.wq(q) k = self.wk(k) v = self.wv(v) q = self.split_heads(q, batch_size) k = self.split_heads(k, batch_size) v = self.split_heads(v, batch_size) scaled_attention, attention_weights = ScaledDotProductAttention()(q, k, v, mask) scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3]) concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model)) output = self.dense(concat_attention) return output, attention_weights
These classes describe the multi-head attention system. After dividing the input into several heads, it applies scaled dot-product attention to each head and then joins the results together. For this reason, the model can focus on data from various representation subspaces at the same time.

4. Define the Encoder Layer

class EncoderLayer(layers.Layer): def __init__(self, d_model, num_heads, dff, rate=0.1): super(EncoderLayer, self).__init__() self.mha = MultiHeadAttention(d_model, num_heads) self.ffn = tf.keras.Sequential([ layers.Dense(dff, activation='relu'), layers.Dense(d_model) ]) self.layernorm1 = layers.LayerNormalization(epsilon=1e-6) self.layernorm2 = layers.LayerNormalization(epsilon=1e-6) self.dropout1 = layers.Dropout(rate) self.dropout2 = layers.Dropout(rate) def call(self, x, training, mask): attn_output, _ = self.mha(x, x, x, mask) attn_output = self.dropout1(attn_output, training=training) out1 = self.layernorm1(x + attn_output) ffn_output = self.ffn(out1) ffn_output = self.dropout2(ffn_output, training=training) out2 = self.layernorm2(out1 + ffn_output) return out2

It describes a single decoder layer that has a feed-forward network and multi-head attention. For regularization and stability, layer normalization and dropout are used.

5. Define the Transformer Encoder

class Encoder(layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = self.positional_encoding(maximum_position_encoding, d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
        self.dropout = layers.Dropout(rate)

    def positional_encoding(self, position, d_model):
        angle_rads = self.get_angles(np.arange(position)[:, np.newaxis], np.arange(d_model)[np.newaxis, :], d_model)
        sines = np.sin(angle_rads[:, 0::2])
        cosines = np.cos(angle_rads[:, 1::2])
        pos_encoding = np.concatenate([sines, cosines], axis=-1)
        pos_encoding = pos_encoding[np.newaxis, ...]
        return tf.cast(pos_encoding, dtype=tf.float32)

    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
        return pos * angle_rates

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x

It includes embedding and positional encoding layers. This class describes the Transformer decoder. Along with its own attention and feed-forward networks, each decoder layer is stacked on top of the others.

6. Compile and Train the Model

# Parameters num_layers = 4 d_model = 128 dff = 512 num_heads = 8 input_vocab_size = 8500 maximum_position_encoding = 10000 encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding) # Dummy input for demonstration sample_input = tf.random.uniform((64, 62), dtype=tf.int64, minval=0, maxval=200) # Model Call sample_output = encoder(sample_input, training=False, mask=None) print(sample_output.shape) # (batch_size, input_seq_len, d_model)

Explanation:

Set the maximal position encoding, model dimension, feed-forward dimension, number of heads, and input word size.

Encoding: Create an instance of the encoder with the given settings.

Create a fake input to show how the encoder works.

Model Call: Let's test the solution by running the sample input through the encoder and printing the shape that comes out.

Doing these steps will help you make a basic Transformer model from start while learning about the main parts and how they work together. You can better understand Transformers and their powerful attention systems by doing this useful code activity.

Frequently Asked Questions for Generative AI

Sequence of prompts stored as linked records or documents.

It helps with filtering, categorization, and evaluating generated outputs.



As text fields, often with associated metadata and response outputs.

Combines keyword and vector-based search for improved result relevance.

Yes, for storing structured prompt-response pairs or evaluation data.

Combines database search with generation to improve accuracy and grounding.

Using encryption, anonymization, and role-based access control.

Using tools like DVC or MLflow with database or cloud storage.

Databases optimized to store and search high-dimensional embeddings efficiently.

They enable semantic search and similarity-based retrieval for better context.

They provide organized and labeled datasets for supervised trainining.



Track usage patterns, feedback, and model behavior over time.

Enhancing model responses by referencing external, trustworthy data sources.

They store training data and generated outputs for model development and evaluation.

Removing repeated data to reduce bias and improve model generalization.

Yes, using BLOB fields or linking to external model repositories.

With user IDs, timestamps, and quality scores in relational or NoSQL databases.

Using distributed databases, replication, and sharding.

NoSQL or vector databases like Pinecone, Weaviate, or Elasticsearch.

With indexing, metadata tagging, and structured formats for efficient access.

Text, images, audio, and structured data from diverse databases.

Yes, for representing relationships between entities in generated content.

Yes, using structured or document databases with timestamps and session data.

They store synthetic data alongside real data with clear metadata separation.



line

Copyrights © 2024 letsupdateskills All rights reserved