Social Media Insights About COVID-19 in Portugal: A Text Mining Approach


  • Carolina Ferraz Marreiros Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, 1649-026 Lisboa, Portugal
  • João Bone Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, 1649-026 Lisboa, Portugal, Select Data, Anaheim, CA, USA
  • Joao C. Ferreira Inov Inesc Inovação—Instituto de Novas Tecnologias, 1000-029 Lisbon, Portugal ,Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, 1649-026 Lisboa, Portugal
  • Ricardo Ribeiro INESC-ID: INESC-ID Lisboa, Portugal, Iscte – Instituto Universitário de Lisboa, Portugal



Social media, COVID-19, natural language processing, sentiment analysis, topic modeling, public opinion


The rapid spread of COVID-19 around the world had a significant impact on daily life. As in other countries, measures were taken in Portugal to combat the exponential increase of cases, such as curfews and the use of masks. Thus, in parallel with the direct consequences on health and the healthcare sector, the pandemic also caused changes in human behavior from a sociological viewpoint.

The objective of this dissertation is to attain a perception of the reality concerning COVID-19. For this purpose, real-time data was extracted from three sources, two of them being social media platforms – Twitter and Reddit – and the other one being Público, a Portuguese online newspaper. The adopted approach, based on topic modelling and sentiment analysis, was validated within the Portugal context, concerning data over a period of one year, but it can equally be employed in similar situations and other countries and provide decision-making support.

After the data extracting, it was prepared for application of natural language processing (NLP) tools specific to the Portuguese language, which can represent a challenge due to the lexical richness. With the gathered information, a dashboard was built, with the purpose of gaining insights on the COVID-19 pandemic in Portugal. It was concluded that the topics discussed on social media reflect the events related to the pandemic. In a final stage, these dashboards were evaluated by public health experts, who highlighted the potential of the results obtained. The data and dashboards will be made available to the scientific community upon request.


Download data is not yet available.

Author Biographies

Carolina Ferraz Marreiros, Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, 1649-026 Lisboa, Portugal

Carolina Ferraz Marreiros received her master’s degree from the Integrated Decision Support Systems Department, University Institute of Lisbon (ISCTE), Lisbon, Portugal. She is currently working in the area of data analytics and artificial intelligence. Her areas of interest include artificial intelligence, Neuro-Linguistic Programming and business intelligence.

João Bone, Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, 1649-026 Lisboa, Portugal, Select Data, Anaheim, CA, USA

João Boné is an NLP developer and researcher working for Select Data, a prominent company in the American Healthcare Industry. He received his master’s degree in Integrated Business Intelligence Systems from ISCTE-Instituto Universitário de Lisboa, and his interests include problem-driven solutions related to data analysis and machine learning applications, mostly connected to NLP. He is often drawn by industry and society-related challenges.

Joao C. Ferreira, Inov Inesc Inovação—Instituto de Novas Tecnologias, 1000-029 Lisbon, Portugal ,Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, 1649-026 Lisboa, Portugal

João C. Ferreira (PhD) is Assistant Professor with habilitation at ISCTE-IUL. His research interests are in: data science, Text Mining, IoT, AI, and AI application health, energy, transportation, Electric Vehicle, Intelligent Transportation Systems (ITS). He has authored more than 250 papers in computer science. He has executed more than 30 projects (6 as PI), more than 180 scientific paper reviews and more than 25 scientific project evaluation. IEEE CIS Chair 2016-2018 and current vice chair of IEEE Blockchain PT, CIS PT chapter and Bruxels AI and robotics. Main organizer of international conferences such as: OAIR 2013, INTSYS from 2018 to 2022. IEEE senior member since 2015. Guest Editor and topic editor of MDPI in the topics of energies, electronics and Sensors. President of the IEEE CIS in PT (2017–2018). Author of a patent in Edge Computer in a monitoring system for fishing vessels.

Ricardo Ribeiro, INESC-ID: INESC-ID Lisboa, Portugal, Iscte – Instituto Universitário de Lisboa, Portugal

Ricardo Ribeiro (PhD) is an Associate Professor at Iscte – Instituto Universitário de Lisboa, where he is the coordinator of the Artificial Intelligence scientific area, and an integrated researcher at INESC-ID Lisboa, working on Human Language Technologies. His current research interests focus on high-level information extraction from unrestricted text, speech or music, and improving machine-learning techniques using domain-related information. He has participated in several European and Nationally-funded projects and was the Human Language Technologies INESC-ID team coordinator in RAGE (2015–2019) European-funded project and the principal investigator of a Ministry of National Defence funded project on information extraction from text. He has participated in several scientific events, either as organiser or as member of the program committee (IJCAI, ICASSP, LREC, Interspeech) and was the editor of a book on the computational processing of Portuguese.


D. Taylor, The Coronavirus Pandemic: A Timeline – The New York Times, 2020.

I. Kislaya, P. Gonçalves, M. Barreto, R. Sousa, A. Garcia, R. Matosa, R. Guiomar and A. Rodrigues, “Seroprevalence of SARS-CoV-2 Infection in Portugal in May-July 2020: Results of the First National Serological Survey (ISNCOVID-19),” Acta Médica Portuguesa, vol. 34, p. 87–94, 2 2021.

WHO report, Coronavirus Disease (COVID-19) Situation Reports, 2021.

Jornal de Notícias, Cronologia dos principais acontecimentos de um ano de covid em Portugal, 2021.

R. Chandrasekaran, V. Mehta, T. Valkunde and E. Moustakas, “Topics, Trends, and Sentiments of Tweets About the COVID-19 Pandemic: Temporal Infoveillance Study,” Journal of Medical Internet Research, vol. 22, p. e22624, 2020.

Y. Marzouki, F. S. Aldossari and G. A. Veltri, “Understanding the buffering effect of social media use on anxiety during the COVID-19 pandemic lockdown,” Humanities and Social Sciences Communications, vol. 8, 2021.

S. Kemp, Digital in Portugal, 2021.

H. Liang, I. C.-H. Fung, Z. T. H. Tse, J. Yin, C.-H. Chan, L. E. Pechta, B. J. Smith, R. D. Marquez-Lameda, M. I. Meltzer, K. M. Lubell and K.-W. Fu, “How did Ebola information spread on twitter: broadcasting or viral spreading?,” BMC Public Health, vol. 19, p. 438, 4 2019.

M. Barthel, How the 2016 presidential campaign is being discussed on Reddit, 2017.

T. Surya Gunawan, N. Aleah Jehan Abdullah, M. Kartiwi and E. Ihsanto, “Social Network Analysis using Python Data Mining,” in 2020 8th International Conference on Cyber and IT Service Management (CITSM), 2020.

A. Whiting and D. Williams, ResearchGate, 2013.

A. M. Kaplan and M. Haenlein, “Users of the world, unite! The challenges and opportunities of Social Media,” Business Horizons, vol. 53, p. 59–68, 1 2010.

A. Agarwal, B. Xie, I. Vovsha, O. Rambow and R. Passonneau, “Sentiment analysis of Twitter data,” in Proceedings of the Workshop on Languages in Social Media, USA, 2011.

J. Lee, A. Jatowt and K.-S. Kim, “Discovering underlying sensations of human emotions based on social media,” Journal of the Association for Information Science and Technology, vol. 72, p. 417–432, 2021.

E. Chen, K. Lerman and E. Ferrara, “Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set,” JMIR Public Health and Surveillance, vol. 6, 5 2020.

C. Tan and L. Lee, “All Who Wander: On the Prevalence and Characteristics of Multi-community Engagement,” in Proceedings of the 24th International Conference on World Wide Web, Republic and Canton of Geneva, CHE, 2015.

M. Paulino, R. Dumas-Diniz, S. Brissos, R. Brites, L. Alho, M. R. Simões and C. F. Silva, “COVID-19 in Portugal: exploring the immediate psychological impact on the general population,” Psychology, Health & Medicine, vol. 26, p. 44–55, 1 2021.

R. Molla, How coronavirus took over social media, 2020.

J. Samuel, G. G. M. N. Ali, M. M. Rahman, E. Esawi and Y. Samuel, “COVID-19 Public Sentiment Insights and Machine Learning for Tweets Classification,” Information, vol. 11, p. 314, 6 2020.

S. N. Saleh, C. U. Lehmann, S. A. McDonald, M. A. Basit and R. J. Medford, “Understanding public perception of coronavirus disease 2019 (COVID-19) social distancing on Twitter,” Infection Control & Hospital Epidemiology, vol. 42, p. 131–138, 2 2021.

C. Machado, Public attention about COVID-19 on social media: An investigation based on data mining and text analysis |Elsevier Enhanced Reader, 2021.

E. Probierz, A. Galuszka and T. Dzida, “Twitter Text Data from #Covid-19: Analysis of Changes in Time Using Exploratory Sentiment Analysis,” Journal of Physics: Conference Series, vol. 1828, p. 012138, 2 2021.

L. Singh, S. Bansal, L. Bode, C. Budak, G. Chi, K. Kawintiranon, C. Padden, R. Vanarsdall, E. Vraga and Y. Wang, “A first look at COVID-19 information and misinformation sharing on Twitter,” ArXiv, 3 2020.

K. Sharma, S. Seo, C. Meng, S. Rambhatla and Y. Liu, “COVID-19 on Social Media: Analyzing Misinformation in Twitter Conversations,” arXiv:2003.12309 [cs], 10 2020.

G. Samuel, S. L. Roberts, A. Fiske, F. Lucivero, S. McLennan, A. Phillips, S. Hayes and S. B. Johnson, “COVID-19 contact tracing apps: UK public perceptions,” Critical Public Health, vol. 0, p. 1–13, 4 2021.

M. Hashemi and M. Hall, “Multi-label classification and knowledge extraction from oncology-related content on online social networks,” Artificial Intelligence Review, vol. 53, p. 5957–5994, 12 2020.

J. C. Lyu and G. K. Luli, “Understanding the Public Discussion About the Centers for Disease Control and Prevention During the COVID-19 Pandemic Using Twitter Data: Text Mining Analysis Study,” Journal of Medical Internet Research, vol. 23, p. e25108, 2 2021.

S. Zhang, W. Pian, F. Ma, Z. Ni and Y. Liu, “Characterizing the COVID-19 Infodemic on Chinese Social Media: Exploratory Study,” JMIR Public Health and Surveillance, vol. 7, p. e26090, 2 2021.

R. Wirth and J. Hipp, “CRISP-DM: Towards a Standard Process Model for Data Mining,” ICECT 2011 – 2011 3rd International Conference on Electronics Computer Technology, p. 11, 2000.

N. Prat, I. Comyn-Wattiau and J. Akoka, “Artifact Evaluation in Information Systems Design-Science Research – a Holistic View,” in PACIS, 2014.

R. Al-Qutaish and K. Al-Sarayreh, “Software Process and Product ISO Standards: A Comprehensive Survey,” European Journal of Scientific Research, vol. 19, p. 289–303, 2 2008.

A. Barata, Primeiro português infetado com covid-19 ficou sem sequelas, 2021.

S. Bird, E. Loper and E. Klein, Natural Language Processing with Python, O’Reilly Media Inc, 2009.

J. Qiang, Y. Li, Y. Yuan, W. Liu and X. Wu, “STTM: A Tool for Short Text Topic Modeling,” arXiv:1808.02215 [cs], 8 2018.

D. M. Blei, “Latent Dirichlet Allocation,” Journal of Machine Learning Research 3, p. 30, 2003.

R. J. d. A. Almeida, rafjaa/LeIA, 2021.

C. J. Hutto and E. Gilbert, “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text,” Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014, p. 10, 2014.

R. R. Nunes, Covid-19. Governo anuncia 8,4 milhões para duplicar capacidade de testagem do país, 2020.

Diário de Notícias, Quantos casos de Covid-19 há em cada concelho de Portugal, 2020.

A. Guimarães, Covid-19: Portugal é o país com mais casos por milhão de habitantes? Este é o outro lado da história |TVI24, 2021.

D. Lai, D. Wang, J. Calvano, A. S. Raja and S. He, “Addressing immediate public coronavirus (COVID-19) concerns through social media: Utilizing Reddit’s AMA as a framework for Public Engagement with Science,” PLoS ONE, vol. 15, 2020.