Abstract:
This study presents a multimodal emotion-aware product recommendation system that integrates
real-time Facial Expression Recognition (FER) and transformer-based sentiment analysis
to enhance personalization in digital environments. Traditional recommender systems rely
mainly on historical interactions or textual reviews and often overlook users’ current emotional
states, leading to inappropriate recommendations. To address this limitation, the proposed system
fuses emotional cues from facial expressions and usergenerated text. FER is performed
across seven emotion categories—happy, sad, angry, fear, disgust, surprise, and neutral—using
a fine-tuned EfficientNetB0 model trained on JAFFE, CK++, FER subsets, and selfcollected
webcam images, achieving an overall accuracy of 88% in realtime conditions. Sentiment analysis
uses a fine-tuned DistilBERT model that classifies text into positive, neutral, and negative
categories with accuracy exceeding 90%. A rule-based multimodal fusion strategy combines
outputs from both modalities, resolving conflicting emotional cues and improving emotional inference
reliability by approximately 10–12% compared to unimodal approaches. The inferred
emotional state is mapped to a structured recommendation database, generating personalized
product suggestions. The system is implemented using a Streamlit-based interface. Experimental
results indicate that the multimodal approach produces recommendations that are better than
those of single-modality systems.