Understanding Natural Language Processing (NLP): The Bridge Between Human Language and Artificial Intelligence

6.07K 0 0 0 0

📗 Chapter 3: Language Modeling and Vector Representations

How Machines Learn and Represent Human Language


🧠 Introduction

At the core of every NLP system lies the ability to understand and predict language. This understanding is powered by two pillars:

  • Language Modeling: Predicting sequences of words.
  • Vector Representations: Converting words into numerical form that machines can process.

Without these, NLP tasks like translation, text classification, chatbots, and question answering wouldn’t be possible. This chapter dives deep into the concepts, mathematics, and practical implementations of language models and word embeddings.


📘 Section 1: What is a Language Model?

A language model (LM) assigns a probability to a sequence of words. It helps answer questions like:

  • What word is likely to come next?
  • Is this sentence grammatically correct?
  • What is the probability of this sequence?

📌 Formal Definition

Given a sequence of words:
w₁, w₂, ..., w
A language model estimates:
P(w₁, w₂, ..., w) = P(w₁) * P(w₂|w₁) * ... * P(w|w₁, ..., wₙ₋₁)


📊 Example:

Sentence

Probability Estimate

"I love NLP"

High

"Dog purple run fast apple"

Very low (nonsensical)


📘 Section 2: N-gram Language Models

The simplest LMs are n-gram models, which predict the next word using the previous (n-1) words.

🔹 Types:

  • Unigram (n=1): Assumes all words are independent.
  • Bigram (n=2): Depends on the previous word.
  • Trigram (n=3): Depends on the previous two words.

🧪 Code: Bigram Model in Python

python

 

from collections import defaultdict

 

corpus = "I love NLP and NLP loves me".lower().split()

bigrams = list(zip(corpus[:-1], corpus[1:]))

 

model = defaultdict(lambda: defaultdict(int))

for w1, w2 in bigrams:

    model[w1][w2] += 1

 

# Normalize

for w1 in model:

    total = float(sum(model[w1].values()))

    for w2 in model[w1]:

        model[w1][w2] /= total

 

print(model["nlp"])  # e.g., {'and': 0.5, 'loves': 0.5}


📘 Section 3: Neural Language Models

N-gram models struggle with long-range dependencies. Neural language models address this by using embeddings and hidden layers.

🔹 Key Architectures:

Model Type

Feature

RNN/LSTM

Handles sequences but struggles with long context

Transformer

Uses attention for global context

BERT/GPT

Pretrained on massive corpora


🧪 Code: Language Modeling with GPT2 (Hugging Face)

python

 

from transformers import GPT2Tokenizer, GPT2LMHeadModel

import torch

 

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

model = GPT2LMHeadModel.from_pretrained("gpt2")

 

input_ids = tokenizer("The future of NLP is", return_tensors="pt").input_ids

output = model.generate(input_ids, max_length=10)

print(tokenizer.decode(output[0]))


📘 Section 4: What are Vector Representations?

Before a machine can work with words, they need to be converted into numbers. Vector representations are dense vectors where similar words are closer in space.


🧠 Why Not One-Hot Encoding?

Word

One-Hot Encoding (Example)

Dog

[0, 1, 0, 0, 0]

Cat

[1, 0, 0, 0, 0]

Problems:

  • High dimensionality
  • No similarity encoding
  • Sparse representation

📘 Section 5: Word Embeddings

Word embeddings solve the above by mapping words to dense, low-dimensional vectors.


🔹 Popular Embedding Methods:

Method

Description

Word2Vec

Predicts context or word from neighbors

GloVe

Learns word co-occurrence statistics

FastText

Considers subword information (handles OOV)

ELMo

Deep contextual embeddings from RNNs

BERT

Contextual embeddings using transformers


🧪 Code: Word2Vec with Gensim

python

 

from gensim.models import Word2Vec

 

sentences = [["I", "love", "natural", "language", "processing"],

             ["NLP", "is", "fun"]]

 

model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, workers=2)

print(model.wv.most_similar("NLP"))


📘 Section 6: Contextual vs Non-Contextual Embeddings

Type

Example Model

Embedding Changes with Context?

Non-contextual

Word2Vec, GloVe

No

Contextual

BERT, GPT

Yes (same word, different meanings)


🧪 Code: BERT Embedding with Hugging Face

python

 

from transformers import BertTokenizer, BertModel

import torch

 

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertModel.from_pretrained('bert-base-uncased')

 

inputs = tokenizer("The bank was flooded after the storm", return_tensors="pt")

outputs = model(**inputs)

print(outputs.last_hidden_state.shape)  # [batch_size, sequence_length, hidden_size]


📘 Section 7: Visualizing Embeddings

To understand embeddings, visualize them using dimensionality reduction:

  • PCA (Principal Component Analysis)
  • t-SNE (t-distributed stochastic neighbor embedding)

🧪 Code: t-SNE for Word Embeddings

python

 

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

 

words = list(model.wv.key_to_index.keys())[:100]

vectors = [model.wv[word] for word in words]

 

tsne = TSNE(n_components=2)

reduced = tsne.fit_transform(vectors)

 

plt.figure(figsize=(14,10))

for i, word in enumerate(words):

    plt.scatter(reduced[i][0], reduced[i][1])

    plt.annotate(word, (reduced[i][0], reduced[i][1]))

plt.show()


📘 Section 8: Embeddings for Sentences and Documents

You can embed entire sentences or documents using models like:

  • Sentence-BERT
  • Universal Sentence Encoder
  • Doc2Vec

🧪 Code: Sentence Embedding with Sentence-BERT

python

 

from sentence_transformers import SentenceTransformer

 

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ["This is a good book", "I enjoyed reading it"]

 

embeddings = model.encode(sentences)

print(embeddings.shape)  # (2, 384)


Chapter Summary Table

Concept

Description

Example Tool

Language Model

Predicts text sequences

GPT, BERT, LSTMs

Word Embeddings

Dense word vectors

Word2Vec, GloVe

Contextual Embedding

Varies with sentence context

BERT, GPT

Sentence Embedding

Fixed vector for entire sentence/document

Sentence-BERT, USE

Back

FAQs


1. What is Natural Language Processing (NLP)?

Answer: NLP is a field of artificial intelligence that enables computers to understand, interpret, generate, and respond to human language in a meaningful way.

2. How is NLP different from traditional programming?

Answer: Traditional programming involves structured inputs, while NLP deals with unstructured, ambiguous, and context-rich human language that requires probabilistic models and machine learning.

3. What are some everyday applications of NLP?

Answer: NLP is used in chatbots, voice assistants (like Siri, Alexa), machine translation (Google Translate), spam detection, sentiment analysis, and auto-correct features.

4. What is the difference between NLU and NLG?

Answer:

  • NLU (Natural Language Understanding): Interprets and extracts meaning from language.
  • NLG (Natural Language Generation): Generates human-like language from data or code.

5. Which programming languages are best for working with NLP?

Answer: Python is the most popular due to its vast libraries like NLTK, spaCy, Hugging Face Transformers, TextBlob, and TensorFlow.

6. What are some challenges in NLP?

Answer: Key challenges include understanding sarcasm, ambiguity, handling different languages or dialects, recognizing context, and avoiding model bias.

7. What is a language model?

Answer: A language model is an AI system trained to predict and generate human-like language, such as GPT, BERT, and T5. It forms the core of many NLP applications.

8. How does NLP handle multiple languages?

Answer: Multilingual models like mBERT and XLM-RoBERTa are trained on multiple languages and can perform tasks like translation, classification, and question-answering across them.

9. Is NLP only for text-based applications?

Answer: No. NLP also works with speech through technologies like speech-to-text (ASR) and text-to-speech (TTS), enabling audio-based applications like virtual assistants.

10. Can I use NLP without being a data scientist?

Answer: Yes! Many low-code/no-code tools (like MonkeyLearn, Google Cloud NLP API, and Hugging Face AutoNLP) let non-experts build NLP solutions using pre-trained models and easy interfaces.