Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Special Interest Groups
  3. Independent Developers
  4. Tokenization Error in Sentiment Analysis Code - How Should Contractions Be Handled?

Tokenization Error in Sentiment Analysis Code - How Should Contractions Be Handled?

Scheduled Pinned Locked Moved Unsolved Independent Developers
python
2 Posts 2 Posters 703 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S Offline
    S Offline
    Sachin Bhatt
    wrote on 25 Jul 2023, 12:33 last edited by
    #1

    Hello,

    I'm attempting to design a Sentiment Analysis model for movie reviews similar to this one, however I'm having trouble with tokenization. Here's an example of my code:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    
    # Load the movie reviews dataset
    data = pd.read_csv('movie_reviews.csv')
    
    # Preprocess the data
    # ... (code for data preprocessing)
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)
    
    # Vectorize the text data using TfidfVectorizer
    vectorizer = TfidfVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)
    
    # Train the Logistic Regression model
    model = LogisticRegression()
    model.fit(X_train_vectorized, y_train)
    
    # Evaluate the model
    accuracy = model.score(X_test_vectorized, y_test)
    print(f"Accuracy: {accuracy}")
    
    

    The issue I'm having is that the accuracy of my model is significantly lower than intended, hovering around 55%. After investigating the tokenization process, I discovered that contractions such as "don't," "can't," "won't," and so on are not handled appropriately. For example, "don't like" is tokenized as "don't" and "like" individually, affecting the model's overall performance.

    Could you kindly advise me on how to solve this tokenization issue and guarantee that contractions are appropriately handled during text preparation in order to increase the accuracy of my sentiment analysis model?

    Thank you for your assistance!

    S 1 Reply Last reply 25 Jul 2023, 19:08
    0
    • S Sachin Bhatt
      25 Jul 2023, 12:33

      Hello,

      I'm attempting to design a Sentiment Analysis model for movie reviews similar to this one, however I'm having trouble with tokenization. Here's an example of my code:

      import pandas as pd
      from sklearn.model_selection import train_test_split
      from sklearn.feature_extraction.text import TfidfVectorizer
      from sklearn.linear_model import LogisticRegression
      
      # Load the movie reviews dataset
      data = pd.read_csv('movie_reviews.csv')
      
      # Preprocess the data
      # ... (code for data preprocessing)
      
      # Split the dataset into training and testing sets
      X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)
      
      # Vectorize the text data using TfidfVectorizer
      vectorizer = TfidfVectorizer()
      X_train_vectorized = vectorizer.fit_transform(X_train)
      X_test_vectorized = vectorizer.transform(X_test)
      
      # Train the Logistic Regression model
      model = LogisticRegression()
      model.fit(X_train_vectorized, y_train)
      
      # Evaluate the model
      accuracy = model.score(X_test_vectorized, y_test)
      print(f"Accuracy: {accuracy}")
      
      

      The issue I'm having is that the accuracy of my model is significantly lower than intended, hovering around 55%. After investigating the tokenization process, I discovered that contractions such as "don't," "can't," "won't," and so on are not handled appropriately. For example, "don't like" is tokenized as "don't" and "like" individually, affecting the model's overall performance.

      Could you kindly advise me on how to solve this tokenization issue and guarantee that contractions are appropriately handled during text preparation in order to increase the accuracy of my sentiment analysis model?

      Thank you for your assistance!

      S Offline
      S Offline
      SGaist
      Lifetime Qt Champion
      wrote on 25 Jul 2023, 19:08 last edited by
      #2

      Hi,

      That's a Qt forum so not really suited for deep learning related questions. You should rather check a dedicated forum. That said, you should check your preprocessing steps to ensure your data are properly prepared.

      Interested in AI ? www.idiap.ch
      Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

      1 Reply Last reply
      1

      1/2

      25 Jul 2023, 12:33

      • Login

      • Login or register to search.
      1 out of 2
      • First post
        1/2
        Last post
      0
      • Categories
      • Recent
      • Tags
      • Popular
      • Users
      • Groups
      • Search
      • Get Qt Extensions
      • Unsolved