Tokenization Error in Sentiment Analysis Code - How Should Contractions Be Handled?

Sachin Bhatt

Hello,

I'm attempting to design a Sentiment Analysis model for movie reviews similar to this one, however I'm having trouble with tokenization. Here's an example of my code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Load the movie reviews dataset
data = pd.read_csv('movie_reviews.csv')

# Preprocess the data
# ... (code for data preprocessing)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)

# Vectorize the text data using TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

# Evaluate the model
accuracy = model.score(X_test_vectorized, y_test)
print(f"Accuracy: {accuracy}")

The issue I'm having is that the accuracy of my model is significantly lower than intended, hovering around 55%. After investigating the tokenization process, I discovered that contractions such as "don't," "can't," "won't," and so on are not handled appropriately. For example, "don't like" is tokenized as "don't" and "like" individually, affecting the model's overall performance.

Could you kindly advise me on how to solve this tokenization issue and guarantee that contractions are appropriately handled during text preparation in order to increase the accuracy of my sentiment analysis model?

Thank you for your assistance!

SGaist

Hi,

That's a Qt forum so not really suited for deep learning related questions. You should rather check a dedicated forum. That said, you should check your preprocessing steps to ensure your data are properly prepared.