• Developer Data
  • Posts
  • Text Cleaning Process: the very first step in every NLP and AI/ML project.

Text Cleaning Process: the very first step in every NLP and AI/ML project.

Hey Devs,

Quick question.

Your app collects user text every day: reviews, search queries, form inputs, messages.

But does your app actually understand any of it?

Not the way you think. To a computer, this:

"I Absolutely LOVED using this App!! It's Running perfectly and the login issues are FIXED now :)"

...is just a messy string of characters. Mixed case. Punctuation everywhere. Emojis. Contractions.

Before any AI model can work with it, you have to clean it first.

This process is called text preprocessing. It is the very first step in every NLP and AI/ML project. And I just learned it — so I'm sharing it here, as simply as possible.

The 7-Step Text Cleaning Pipeline

We'll take that one messy sentence and watch it transform: step by step.

Step 1 - Lowercasing

"LOVED" and "loved" are two completely different words to a computer. One line fixes it.

text = "I Absolutely LOVED using this App!!"
text = text.lower()

# output: "i absolutely loved using this app!!"

Simple. But skipping this breaks everything downstream.

Step 2 - Regex Cleaning

Punctuation, emojis, symbols: they add noise, not meaning. Remove them with a single pattern.

import re

text = re.sub(r"[^a-z\s']", '', text)

# keeps: letters, spaces, apostrophes
# removes: !! :) digits, symbols
# output: "i absolutely loved using this app it's running perfectly..."

The [^a-z\s'] pattern means: keep only lowercase letters, spaces, and apostrophes. Remove everything else.

Step 3 - Tokenisation

Split the clean sentence into individual words — a list of tokens. Each word becomes a separate, processable unit.

import nltk

tokens = nltk.word_tokenize(text)

# ["i", "absolutely", "loved", "using",
#  "this", "app", "it", "'s", "running",
#  "perfectly", "login", "issues", "fixed"]

Notice: "it's" splits into "it" and "'s" - NLTK handles contractions intentionally. Each part can be analysed separately.

This is the foundation. Every next step works on this list.

Step 4 - Stop Words Removal

Words like "i", "this", "and", "the" carry almost zero meaning. NLTK has a built-in list of ~150 common English stop words. Filter them out.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words]

# ["absolutely", "loved", "app", "running",
#  "perfectly", "login", "issues", "fixed"]

Look at that list. You already understand the review without reading the original sentence. That is the meaning. The noise is gone.

Step 5 - Stemming

"running", "runs", "ran" — same idea, different words. A computer sees them as completely different. Stemming cuts each word to a rough root form.

from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed = [ps.stem(w) for w in filtered]

# ["absolut", "love", "app", "run",
#  "perfectli", "login", "issu", "fix"]

Wait, "absolut"? "perfectli"? Those are not real words.

That is the limitation of stemming. It is fast, but crude. It just chops the ending off. Sometimes it goes too far. Which brings us to the smarter version.

Step 6 - Lemmatization

Same job as stemming — but properly. Instead of blindly chopping, it checks a dictionary and finds the actual root word (the lemma).

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered]

# ["absolutely", "love", "app", "running",
#  "perfectly", "login", "issue", "fix"]

Real words. Proper roots. Slower than stemming — but cleaner output.

Rule of thumb: Use stemming when speed matters. Use lemmatization when accuracy matters. For most real projects, use lemmatization.

Step 7 - N-grams

Right now, we have been looking at one word at a time. But some meaning only exists in combinations.

"login" alone - fine. "issues" alone — could be anything. "login issues" together — that is a specific problem users are reporting.

N-grams capture these combinations. N = number of words. 2 words = bigram. 3 words = trigram.

from nltk.util import ngrams

bigrams = list(ngrams(lemmatized, 2))

# ("login", "issue")
# ("run", "perfectly")
# ("absolutely", "love")

Now your app understands phrases and context — not just individual words.

The Full Transformation

Start:

"I Absolutely LOVED using this App!! It's Running perfectly 
and the login issues are FIXED now :)"

After all 7 steps:

["love", "app", "run", "perfectly", "login", "issue", "fix"]

# Bigrams:
("login", "issue"), ("run", "perfectly"), ("absolutely", "love")

The computer can now actually work with this. Sentiment analysis. Search. Auto-tagging. Recommendations. All of it starts here, with clean data.

Setup - One Library Does Everything

pip install nltk
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

That is it. One library. Seven steps. Every NLP project starts here.

Why Should You Care as a Developer?

You do not need to become an ML engineer. But knowing this means you can now build:

  • Search that actually works — remove stop words before indexing

  • Auto-tagging on user reviews or blog posts

  • Smarter filters on feedback — group by topic automatically

  • Foundation for any AI feature you want to add to your app

JavaScript developers: The natural library covers these concepts, too. But Python's NLTK is the standard for learning. Start there, bring the concepts back.

Watch the Full Video

I made a complete video on this — one sentence transforming live through every single step, with code running in real time.

Drop a comment: which step was completely new for you?

Until next week,

Rajon Software Developer | Learning AI/ML in public

Developer Data — practical insights for developers, every week. Unsubscribe · View in browser developer-data.beehiiv.com