- Developer Data
- Posts
- Text Cleaning Process: the very first step in every NLP and AI/ML project.
Text Cleaning Process: the very first step in every NLP and AI/ML project.
Hey Devs,
Quick question.
Your app collects user text every day: reviews, search queries, form inputs, messages.
But does your app actually understand any of it?
Not the way you think. To a computer, this:
"I Absolutely LOVED using this App!! It's Running perfectly and the login issues are FIXED now :)"
...is just a messy string of characters. Mixed case. Punctuation everywhere. Emojis. Contractions.
Before any AI model can work with it, you have to clean it first.
This process is called text preprocessing. It is the very first step in every NLP and AI/ML project. And I just learned it — so I'm sharing it here, as simply as possible.
The 7-Step Text Cleaning Pipeline
We'll take that one messy sentence and watch it transform: step by step.
Step 1 - Lowercasing
"LOVED" and "loved" are two completely different words to a computer. One line fixes it.
text = "I Absolutely LOVED using this App!!"
text = text.lower()
# output: "i absolutely loved using this app!!"Simple. But skipping this breaks everything downstream.
Step 2 - Regex Cleaning
Punctuation, emojis, symbols: they add noise, not meaning. Remove them with a single pattern.
import re
text = re.sub(r"[^a-z\s']", '', text)
# keeps: letters, spaces, apostrophes
# removes: !! :) digits, symbols
# output: "i absolutely loved using this app it's running perfectly..."The [^a-z\s'] pattern means: keep only lowercase letters, spaces, and apostrophes. Remove everything else.
Step 3 - Tokenisation
Split the clean sentence into individual words — a list of tokens. Each word becomes a separate, processable unit.
import nltk
tokens = nltk.word_tokenize(text)
# ["i", "absolutely", "loved", "using",
# "this", "app", "it", "'s", "running",
# "perfectly", "login", "issues", "fixed"]Notice: "it's" splits into "it" and "'s" - NLTK handles contractions intentionally. Each part can be analysed separately.
This is the foundation. Every next step works on this list.
Step 4 - Stop Words Removal
Words like "i", "this", "and", "the" carry almost zero meaning. NLTK has a built-in list of ~150 common English stop words. Filter them out.
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words]
# ["absolutely", "loved", "app", "running",
# "perfectly", "login", "issues", "fixed"]Look at that list. You already understand the review without reading the original sentence. That is the meaning. The noise is gone.
Step 5 - Stemming
"running", "runs", "ran" — same idea, different words. A computer sees them as completely different. Stemming cuts each word to a rough root form.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed = [ps.stem(w) for w in filtered]
# ["absolut", "love", "app", "run",
# "perfectli", "login", "issu", "fix"]Wait, "absolut"? "perfectli"? Those are not real words.
That is the limitation of stemming. It is fast, but crude. It just chops the ending off. Sometimes it goes too far. Which brings us to the smarter version.
Step 6 - Lemmatization
Same job as stemming — but properly. Instead of blindly chopping, it checks a dictionary and finds the actual root word (the lemma).
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered]
# ["absolutely", "love", "app", "running",
# "perfectly", "login", "issue", "fix"]
Real words. Proper roots. Slower than stemming — but cleaner output.
Rule of thumb: Use stemming when speed matters. Use lemmatization when accuracy matters. For most real projects, use lemmatization.
Step 7 - N-grams
Right now, we have been looking at one word at a time. But some meaning only exists in combinations.
"login" alone - fine. "issues" alone — could be anything. "login issues" together — that is a specific problem users are reporting.
N-grams capture these combinations. N = number of words. 2 words = bigram. 3 words = trigram.
from nltk.util import ngrams
bigrams = list(ngrams(lemmatized, 2))
# ("login", "issue")
# ("run", "perfectly")
# ("absolutely", "love")Now your app understands phrases and context — not just individual words.
The Full Transformation
Start:
"I Absolutely LOVED using this App!! It's Running perfectly
and the login issues are FIXED now :)"After all 7 steps:
["love", "app", "run", "perfectly", "login", "issue", "fix"]
# Bigrams:
("login", "issue"), ("run", "perfectly"), ("absolutely", "love")The computer can now actually work with this. Sentiment analysis. Search. Auto-tagging. Recommendations. All of it starts here, with clean data.
Setup - One Library Does Everything
pip install nltkimport nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')That is it. One library. Seven steps. Every NLP project starts here.
Why Should You Care as a Developer?
You do not need to become an ML engineer. But knowing this means you can now build:
Search that actually works — remove stop words before indexing
Auto-tagging on user reviews or blog posts
Smarter filters on feedback — group by topic automatically
Foundation for any AI feature you want to add to your app
JavaScript developers: The natural library covers these concepts, too. But Python's NLTK is the standard for learning. Start there, bring the concepts back.
Watch the Full Video
I made a complete video on this — one sentence transforming live through every single step, with code running in real time.
Drop a comment: which step was completely new for you?
Until next week,
Rajon Software Developer | Learning AI/ML in public
Developer Data — practical insights for developers, every week. Unsubscribe · View in browser developer-data.beehiiv.com