Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. Python has become a dominant language in NLP due to libraries like NLTK and SpaCy, which provide robust tools for text processing, statistical analysis, and linguistic annotation.
In this document, we'll cover the theory and usage of both NLTK and SpaCy, walk through extensive code examples, and highlight best practices. By the end, youβll be equipped with practical NLP skills using two of the most popular Python libraries.
NLTK is one of the oldest and most widely used libraries in NLP. It provides a suite of tools for tokenization, tagging, parsing, semantic reasoning, and more. Its educational and experimental focus makes it ideal for learning and research.
SpaCy is a modern, high-performance NLP library designed for production. It excels in speed and usability, offering efficient pre-trained pipelines for tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and word vectors.
pip install nltk
After installation, download required data packages:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
pip install spacy
python -m spacy download en_core_web_sm
import nltk
text = "Hello world! This is an example text for NLP."
tokens = nltk.word_tokenize(text)
print(tokens)
With SpaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)
sentences = nltk.sent_tokenize(text)
print(sentences)
Using SpaCy:
sentences = list(doc.sents)
print([sent.text for sent in sentences])
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered = [w.lower() for w in tokens if w.lower() not in stop_words]
print(filtered)
SpaCy version:
filtered = [token.text.lower() for token in doc if not token.is_stop]
print(filtered)
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stems = [stemmer.stem(w) for w in filtered]
lemmas = [lemmatizer.lemmatize(w) for w in filtered]
print(stems)
print(lemmas)
Using SpaCyβs lemmatizer:
lemmas = [token.lemma_ for token in doc if not token.is_stop]
print(lemmas)
nltk_pos = nltk.pos_tag(tokens)
print(nltk_pos)
With SpaCy:
pos_tags = [(token.text, token.pos_, token.tag_) for token in doc]
print(pos_tags)
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
tree = nltk.ne_chunk(nltk_pos)
print(tree)
With SpaCyβs pipeline:
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)
SpaCy supports accurate dependency parsing:
for token in doc:
print(token.text, token.dep_, token.head.text)
This is not supported in classic NLTK.
grammar = "NP: {?*}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(nltk_pos)
print(result) from nltk.corpus import wordnet as wn
syn = wn.synsets('dog')
print(syn[0].definition())
lemmas = wn.lemmas('dog')
print([l.name() for l in lemmas])
syn1 = wn.synsets('car')[0]
syn2 = wn.synsets('automobile')[0]
print(syn1.path_similarity(syn2))
from gensim import corpora, models
from nltk.tokenize import word_tokenize
docs = [
"Cats are small animals.",
"Dogs are loyal and friendly.",
"Cats and dogs are popular pets."
]
texts = [[w.lower() for w in word_tokenize(doc) if w.isalpha()] for doc in docs]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=2)
print(lda.print_topics())
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(docs)
print(cv.get_feature_names_out())
print(X.toarray())
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf.fit_transform(docs)
print(tfidf.get_feature_names_out())
print(X_tfidf.toarray())
nlp = spacy.load('en_core_web_md')
doc = nlp("SpaCy provides word embeddings.")
vec = doc[2].vector
print(len(vec))
doc1 = nlp("I like cats.")
doc2 = nlp("I like dogs.")
print(doc1.similarity(doc2))
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
docs = [
"I love programming.",
"This movie is terrible.",
"The food was amazing!",
"I hate this weather."
]
labels = [1, 0, 1, 0]
X_train, X_test, y_train, y_test = train_test_split(
docs, labels, test_size=0.5, random_state=42)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', MultinomialNB())
])
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
print(classification_report(y_test, preds))
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
from spacy.training import Example
from spacy.util import minibatch
nlp = spacy.load("en_core_web_sm")
ner = nlp.get_pipe("ner")
ner.add_label("ANIMAL")
TRAIN_DATA = [
("Cats are animals", {"entities": [(0, 4, "ANIMAL")]}),
("Dogs are animals too", {"entities": [(0, 4, "ANIMAL")]})
]
optimizer = nlp.begin_training()
for itn in range(20):
for text, annotations in TRAIN_DATA:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example], sgd=optimizer)
doc = nlp("I have a pet dog")
print([(ent.text, ent.label_) for ent in doc.ents])
from sklearn.metrics import precision_score, recall_score, f1_score
y_true = [1, 0, 1, 1]
y_pred = [1, 0, 0, 1]
print(precision_score(y_true, y_pred))
print(recall_score(y_true, y_pred))
print(f1_score(y_true, y_pred))
Leverage NLTKβs VADER or train your own classifier for sentiment tasks. SpaCy offers extensible pipelines too.
Use extractive techniques (e.g., Gensimβs summarize) or abstractive methods with deep learning.
Use SpaCy for preprocessing and integrate with transformer models via Hugging Face.
Combine SpaCy for preprocessing and Gensim for LDA topic modeling.
This comprehensive guide introduces NLP with both NLTK and SpaCy. We covered tokenization, tagging, parsing, normalization, embeddings, classification, and more advanced topics. NLTK provides a flexible platform for experimentation and learning, while SpaCy focuses on speed and production readiness.
By combining both libraries along with tools like Scikitβlearn and Gensim, you're well poised to build effective NLP pipelines for realβworld text mining, classification, summarization, translation, and information extraction tasks.
Python is commonly used for developing websites and software, task automation, data analysis, and data visualisation. Since it's relatively easy to learn, Python has been adopted by many non-programmers, such as accountants and scientists, for a variety of everyday tasks, like organising finances.
Learning Curve: Python is generally considered easier to learn for beginners due to its simplicity, while Java is more complex but provides a deeper understanding of how programming works.
The point is that Java is more complicated to learn than Python. It doesn't matter the order. You will have to do some things in Java that you don't in Python. The general programming skills you learn from using either language will transfer to another.
Read on for tips on how to maximize your learning. In general, it takes around two to six months to learn the fundamentals of Python. But you can learn enough to write your first short program in a matter of minutes. Developing mastery of Python's vast array of libraries can take months or years.
6 Top Tips for Learning Python
The following is a step-by-step guide for beginners interested in learning Python using Windows.
Best YouTube Channels to Learn Python
Write your first Python programStart by writing a simple Python program, such as a classic "Hello, World!" script. This process will help you understand the syntax and structure of Python code.
The average salary for Python Developer is βΉ5,55,000 per year in the India. The average additional cash compensation for a Python Developer is within a range from βΉ3,000 - βΉ1,20,000.
Copyrights © 2024 letsupdateskills All rights reserved