GenAI End to End
The intent of this blog is to present the a very high-level explanation of the end-to-end processing involved in a typical interaction
with the Gen AI model:
- User: The process begins with the user, who initiates the AI interaction. This
usually happens over a chatbot console like ChatGPT.
- Generate Prompt: The user generates a prompt or input they want the AI to process.
A prompt is generally a text statement in users' natural language called NLP in the AI/ML context.
- Pre-Processing: Before tokenisation, there might be a pre-processing step where the
input is cleaned or formatted. Sometimes, this may involve adding steps to check sensitivity and
bias in the context.
- Tokenisation: The input is tokenised into manageable pieces, like words or
sub-words.
- Generate Embedding: The tokens are converted into embeddings, vector
representations that capture the semantic meaning.
- Positional Encoding: Positional encodings are added to embeddings to give the model
information about the order of the tokens. These steps are part of the AI service processing.
- Core Transformer Architecture: The embeddings are passed into the core Transformer
model, which consists of an encoder and decoder.
- Attention Layer (Encoder): In the encoder, the attention mechanism helps the model
to focus on different parts of the input sequence.
- Feed-Forward Networks: Inside the encoder and decoder are feed-forward networks in
addition to the attention layers.
- Multi-Head Attention: The attention mechanism is often multi-headed, meaning it
runs several attention processes in parallel.
- Attention Layer (Decoder): The decoder's attention mechanism helps generate the
output by focusing on relevant and partially generated parts of the input.
- Output Embedding: The decoder's output is transformed into an output embedding.
- Softmax Layer: A softmax layer converts the output embeddings into probabilities
for the following tokens.
- Post-Processing: The output may be post-processed to refine it, such as correcting
grammar or style.
- Output: The processed output is delivered back to the user.
Additional Considerations:
- Layer Normalization and Residual Connections: These are included within each
attention and feed-forward layer to help train deeper models.
- Training Process: The model adjusts its parameters based on a large dataset during
the training phase.
- Parameter Optimization: The model parameters are adjusted through optimisation
techniques to minimise the loss function.
- Loss Calculation: The model calculates the loss to measure the difference between
its predictions and the actual outputs during training.
- Beam Search or Sampling: The model may use techniques like beam search or sampling
to generate the most likely following tokens during decoding.
- Hyperparameters and Model Configuration: The model's configuration, such as the
number of layers, embedding sizes, and hyperparameters, is crucial for performance but not shown in
the flowchart.