Machine Learning-Based Approach for Fake News Detection

H. L. Gururaj^1,*, H. Lakshmi², B. C. Soundarya^3,*, Francesco Flammini⁴ and V. Janhavi⁵

¹Department of Information Technology, Manipal Institute of Technology Bengaluru, Manipal Academy Of Higher Education, Manipal, India
²Department of Information Science and Engineering, Vidyavardhaka College of Engineering, Mysuru, India
³Department of Artificial Intelligence and Machine Learning, Alva’s Institute of Engineering and Technology, Mangalore, India
⁴IDSIA USI-SUPSI, Department of Innovative Technologies, University of Applied Sciences and Arts of Southern Switzerland, CH
⁵Department of Computer Science and Engineering, Vidyavardhaka College of Engineering, Mysuru, India
E-mail: gururaj.hl@manipal.edu, soundarya@aiet.org.in
*Corresponding Author

Received 01 January 2022; Accepted 23 September 2022; Publication 02 December 2022

Abstract

In the modern era where the internet is found everywhere and there is rapid adoption of social media which has led to the spread of information that was never seen within human history before. This is due to the usage of social media platforms where consumers are creating and sharing more information where most of them are misleading with no relevance with reality. Classifying the text article automatically as misinformation is a bit challenging task. This development addresses how automated classification of text articles can be done. We use a machine learning approach for the classification of news articles. Our study involves exploring different textual properties that may be often used to distinguish fake contents from real ones. By using those properties, can train the model with different machine learning algorithms and evaluate their performances. The classifier with the best performance is used to build the classification model which predicts the reliability of the news articles present in the dataset.

Keywords: Fake news, machine learning, classification.

1 Introduction

Social media and the internet have the access to information data a whole lot simpler and comfortable. As most of our lives are spent interacting online via social media systems, we generally look out for or devour news from web-based platforms rather than conventional news associations because it is easy to share and talk about the information with buddies or different readers in social media. Instead of the advantages supplied through social media, the standard and high quality of stories are much less than the traditional news organizations. News outlets have been benefited through the enormous use of different social media platforms by giving up-to-date information in close to the actual time to its subscribers. The news articles with intentionally deliberately fake facts are produced online for spreading the news with the purpose of financial or political gain. The more spread of fake news information might negatively impact society and individuals. False information may do authenticity equilibrium break of ecosystem related to the news. False news is typically manipulated and its modifications the manner how human beings interpret and see real news. False news is used by spammers mainly to make revenues by using advertisements through click-baits.

Fake news is one of the greatest threats to commerce, journalism, and democracy all over the world, with huge collateral damages. A US $130 billion loss in the stock market was the direct result of a fake news report that US President Barak Obama got injured in an explosion [1]. Another case of fake news campaigns that demonstrate the enormous impact that fake news can have include the sudden shortage of salt in Chinese supermarkets after a fake report that iodized salt would help counteract the effects of radiation after the Fukushima nuclear leak in Japan [2], and an escalation of tensions between India and Pakistan that began with fake reporting of the Bagalkot strike and resulted in the deaths of military personnel and the loss of expensive military equipment.

Social media has a big impact on society, so few people take advantage or benefit from this reality. This will result in generating data articles, which aren’t real or possibly fake. Some websites produce fake news deliberately post half of-truths, hoaxes, and disinformation which assert to be real information. They use a social community to power website visitors of the net. The essential intention of fake records internet sites is to affect ordinary public opinion on topics.

2 Literature Survey

Mykhailo Granik et al. in their paper proposed a totally simple technique for phony news identification by the utilization of credulous Bayes classifier. This kind of got completed as a product program and got inspected towards an agreement set of Facebook news posts. These have been amassed from the three huge pages of Facebook all of them have news from the right and the left, and there are three gigantic, huge standard political news pages. They have accomplished the exactness of arrangement about generally 80% of Classification precision for news which is phony is somewhat more terrible. This may cause the skewness of the dataset to just 4.7% is likewise a piece of phony information.

Himank Gupta has given a system which is upheld varying sorts of AI idea which has different issues, which has the lack of exactness, interference, time-stretch which handles numerous tweets in scarcely any sec. To start with, they gather 40,000 tweets from the HSpam14 of a dataset. Then, at that point, they will go on and describe the 150,000 tweets which are spam and 250,000 tweets that are non-spam. They will likewise determine the assortment of lightweight highlights along with Top-30 words which are giving the most perfect information acquired from the Bag of Words Model. 4. Accomplishment has had the option to get with the precision of around 92% and defeated this arrangement by practically 18%.

Marco l. dellavedova et. al.is the principal who proposed a hypothesis of ML counterfeit news location idea which should be possible by the blend of a news channel and the social substance highlights, outperforms and strategies which are existing in the writing, exactness has been expanded up to 79.8%. Besides, their strategy has been executed in the Facebook Messenger Chatbot which they checked its anything but true applications, by which counterfeit news recognition exactness was acquired around 82%. Just objective was to separate the reason as solid or phony, they chief portrayed dataset which they would use for his/her test, then, at that point, the content-based methodology has been introduced that they would execute and the system they proposed to follow it’s anything but a socially based term accessible in writing. The outcome dataset comprises 15,600 posts, which came from 33 pages with more than 220000 preferences by 800,000 above clients. 8,922 fabrications posts and 6,578 non-scams posts.

Cody buntain et al. has fostered a strategy for mechanization of a phony news distinguishing on applications like Twitter and by understanding the method of expectation of the exactness of evaluations in exact way 2 believability news coverage-based appraisals of the correctness. This kind of strategy is applied to the Twitter content which is sourced from Buzz Feeds counterfeit news datasets. The element investigation recognizes every one of the highlights which will be for the most part anticipated by structure publicly supports and editorial exactness tests, and consequences of these are following past work. They rely upon recognizing the profoundly re-tweeted strings of discussions and utilizations the highlights of these sensible strings to separate the tales and constraints of these work appropriateness just to the well-known arrangement of tweets. As a larger part of tweets that are infrequently re-tweeted, this technique consequently can be just utilized on a minor measure of the Twitter discussion strings.

Shivam B. Parikh et al. likes to introduce an outline of the portrayal of the article in the conditions of the late populace along with the differentiable substance styles of the article and its peruser’s effect. Further, we move into existing phony news discovery and intensely upheld approaches that are text-based investigation and tell the famous ones of phony news datasets. We finish up the paper by recognizing five keys open-research difficulties which will assist us with knowing the future exploration. It is likewise the hypothetical way approach that gives the outlines about the phony news recognition and by breaking down the variables like mental ones.

One of the earliest works at the automated detection of fake news become via Vlachos and Riedel. The authors described the project of fact-checking, gathered a dataset from two popular reality-checking websites, and took into consideration k-Nearest Neighbors classifiers for managing fact-checking as a class undertaking. Wang (2017) released the LIAR dataset, which contains 12.8K manually labeled brief statements from PolitiFact.

Table 1 Techniques and datasets that have been used for fake news detection

Work	Year	Detection	Model
Wang	2016	Fake news 6 levels	majority LR, SVM, bi-LSTM, CNN
Ma et al.	2016	Rumors 2 levels	RF, DT SVM, RNN
Ruchansky	2017		LSTM
Popat	2018	Credibility 2 or 5 levels	bi-LSTM, LSTM, CNN
Buntain and Gollbeck	2017	Credibility	RF
Yang et al.	2018	Fake news 2 classes	TI-CNN,LSTM, RNN
Karimi et al.	2019	Fake news 2 classes	N-grams, LIWC, RST, BiGRNN-CNN LSTM, HDSF
Ahmed et al.	2018	Fake news & reviews	SVM, SDG, LR, KNN, DT
Zhou et al.	2020	Fake news Clickbaits Disinform	SVM, RF XGB, LR, NB
Pamungkas et al.	2016	Stance	LR

Table 1 shows the various techniques and datasets that have been used by various researchers for fake news detection.

3 Methodology

This framework involves fundamental theoretical information about every single component and every aspect of the project. It gives a picture of the simulation implementation and modeling, statistical analysis, software implementation, and calculations done.

3.1 System Design

The system is developed in two parts. They are Static Implementation and Dynamic Implementation as shown in Figure 1. The first component is Static implementation that works on Machine Learning algorithms. Here, what we do is extract the features from the dataset which is already pre-processed. The features which are extracted are fed into four different classifiers. The classifiers used are Logistic Regression, Random Forest Classifier, Support Vector Machine, and Passive-Aggressive Classifier. After fitting the model, we compare the accuracies. Model performance is determined with the help of a confusion matrix. The second component is Dynamic implementation which will take the keyword or text present in the news articles from the dataset.

Figure 1 System design.

3.2 Algorithms Used for Classification

3.2.1 Random forest

Random Forest is the brand name for an outfit of decision trees. In Random Forest, we will have an assortment of the choice trees. To classify a brand-new object which is supported by its attributes and every tree will give a special classification and which we say the tree “votes” for that specific class. Random forest consistently picks the characterization which has more votes. Random forest may be a classification algorithm that consists of many sorts of decisions trees. Random forest will use different characteristic randomness and bagging when the building of every single tree and uncorrelated forest of trees will be created whose forecast is made by the committee is most accurately perfect than the other single tree. Random forest like the name suggests comprises a more prominent number of single and individual choice trees that will be worked as a group. Every individual and single tree within the forest spits out into a prediction of class which of sophistication with a high no of votes and that will become our prediction model. The explanation of each one of the models will work so well as an outsized number of models which are relative and which are uncorrelated will be operating as a committee that can outperform anybody of the individual and single constituent models. To know how random forest makes sure that the behavior of every individual tree is not an excessive amount of correlated with the behavior of any other individual trees within the model. It uses the two subsequent methods:

Bagging

Decisions trees are more delicate to the information and that are trained and little modifications within the set of training may end up in various unlike the structure of trees. Random forest will exploit this by permitting every individual tree to a dataset with a replacement which will randomly sample that leads to various types of trees. This kind of process is called bootstrapping or bagging.

Feature randomness

Within the ordinary decision tree, the time comes for a node when is the separation, we can consider every feature possible and might select which produces most of the separation which is between the observations within the node which is left and the right node. In contrast with each and, every single tree in a random forest will take from the attributes that are subset to it only.

It is a kind of classification algorithm. This is not a regression algorithm. It is an effective method for binary classification problems. In understandable words, it can predict the probability of an event occurring that may be found by fitting the data to the function which is called the logit function. So, it is also called logit regression. It can predict the output and its probability which is in the range of1 and 0. This is named for the function, which is used at the main method, the Logistic function, which is also called the sigmoid function. It is a curve that is S-shaped that can take any esteemed number which is real and can map it into a value between 1 and 0 but will never suppress these points. 1/(1 $+$ e $^{\land}$ -value) Where e is the base of the natural logarithms.

As the aim of our model is to simply classify the news article as true/false, logistic regression is a good choice.

1. We first create an object of Logistic Regression.

2. The logistic regression classifier is trained to bypass the news article training set into the fit function. After it is trained, predictions are made on the test set using the predict function.

3. Accuracy is calculated to understand the performance of the classifier.

The Passive-Aggressive Classifier is an algorithm that is used online, and it is perfect for classifying nearly all of the huge data streams. It is very simple to implement, and it is much faster, and it can be explained by an example, where learning and gaining from it would be easier and afterward sput it aside. An algorithm of this kind will remain passive for the correctly predicted data of outcomes and will turn more aggressive for the false data of outcomes and will make the update and also will make required adjustments. Unlike other algorithms, it will not converge. These is called Passive-Aggressive algorithms because of the following reasons:

• PASSIVE: If it is the correct prediction, then it will keep the model the same and will not make any changes.

• AGGRESSIVE: If it is an incorrect prediction, then it will make changes to the model.

INPUT: Aggressiveness parameter C $>$ 0 INITIALIZE: w1 $=$ (0,…,0) For t $=$ 1,2, …

• Receive instance: xt e $ℝ^{n}$

• Predict: yt $=$ sign (wt, xt)

• Receive correct label: yt e { $-$ 1, $+$ 1}

• Suffer loss: lt $=$ max {0, 1 $-$ yt (wt,xt)}

• Update

3.5 Support Vector Machine

This is ML model which is supervised that will use classification algorithms for two-group classification problems. In the wake of giving SVM model arrangements of marked preparing information for each class, they are prepared to classify new content. It uses the concept of a hyperplane to separate the two classes. The point of SVM is to partition the datasets into classes to track down a most extreme negligible hyperplane. SVM uses a technique called the kernel trick to transform data and then based on these transformations it finds an optimal boundary between the possible outputs.

Given a set of n features, the SVM algorithm uses n-dimensional space to plot the data item with the coordinates representing the value of each feature. The hyperplane obtained to separate the two classes is used for classifying the data.

SVM Pseudocode

F [0…N $-$ 1]: A feature set in N features that is sorted by information gain in decreasing order accuracy (i): accuracy of the prediction model based on SVM with F [0…i] gone set

The flowchart given in Figure 2 starts with the collection of datasets. The dataset is preprocessed and then it is subjected to features election. Four different Machine Learning algorithms are used to train the model. Confusion matrix which is used here to calculate performance and accuracy Results from all classifiers are compared and the classifier which gives the best accuracy for the given dataset is used to build the classification model to predict the reliability of the news.

4 Implementation

The implementation part is almost the same as the framework and system design part which describes the system, which is performed at the finest level of details, down to the code level. This topic is regarding the realization of the topics and earlier developed ideas.

4.1 Data Collection

Online news is mostly collected from different types of sources, like press agencies, search engines, and websites of social media. The dataset used in our project is a simple and realistic dataset that contains 6335 news articles simply classified as Fake or Real and later stored in a CSV file.

The attributes of the dataset are:

Id: A unique identifier for the article

Title: Article Headline

Text: Article textual content

Class Label: Fake and Real

Figure 3 Pie chart derived from the dataset.

4.2 Data Pre-processing

Data that is taken from social media will be mostly not structured and most of which will be an informal type of communication with shortcuts, different slang, and bad grammar. To increase the performance and reliability we have to pre-process the data before using it as a predictive model.

4.3 Data Cleaning

The data may be in the format of either structured or unstructured. A structured format has patterns that are well defined and the unstructured data do not have a proper structure. Among structured and semi-structured formats, comparison of the better structured and then unstructured way. Text and the data have to be cleaned to highlight the attributes which we want our machine learning system to work on accuracy.

It comprises a few steps:

(1) Punctuation Removal Punctuations will give the grammatical expression to the sentence which enhances our understandability. The vectorize checks the number of words and cannot be the context that it doesn’t add esteem, hence we will eliminate every character which is special. Example: “How are you doing?” Instead “How are you doing”.

(2) Tokenization It divides the context into certain units as sentences split into words like this the unstructured text is given a structure.

Example: “work at the place” is split into wok“ “at ” “the ”.

(3) Removing Stop words These are the common words which are mostly appeared in any text. They lend much about the info hence we remove them. Example: copper or aluminum is okay for me- $>$ copper, aluminum, okay.

(4) Stemming This process helps us to reduce the length of the word to its stem form. It generally treats words that are related similarly. Suffices, like “er”, “ible”, “ness”, etc. are removed by rule-based approach.

After data cleaning is done we perform exploratory data analysis to improve the statistical analysis of the given dataset.

4.4 Feature Generation

Many features like word count, repetition of distinctive words, repetition of large words, etc can be done using text data. It is done through the creation of the representation of words that catch meanings, relationships, and numerous other types of context which are used within, Computer is made to understand the given text and then do the classification of text. To make machine learning algorithms understand our data vectorizing is done which encodes the text into integers which is numerical form thus to vectors.

• Count Vectorizer tells us about the words that are present in the data which are texts.

The result is given as 1 if it is there in the sentence or else 0 is. For each text document bag of words is created along with the document.

• TF-IDF calculates the relative frequency (number of times repeated) of the word that is seen in the document when its frequency is compared within each and every document. TF addresses the Term Frequency and also computes the frequency of a term appearing within a document. IDF stands for Inverse Document Frequency.

To store each word of relative count in the document matrix TF-IDF is applied to the body.

$TF (t, d)$	$= \frac{Number of times t occurs in document ‘d’}{Total word count of document ‘d’}$
$IDF (t, d)$	$= (\frac{Total number of documents}{Number of documents with term t in it})$
$TFIDF (t, d)$	$= T F (t, d) * IDF (t)$

To store each word of relative count in the document matrix TF-IDF is applied on the body.

4.5 Training the Model

Using four different machine learning classifiers like Logistic regression, Random forest classifier, Support vector machine, and Passive aggressive classifier, training of the model is done after the features are extracted from the pre-processed datasets. A passive-aggressive classifier is our good performing classifier hence it is selected finally. After this, it is stored in the disk where it is used for fake news classification. It in-takes the article in the dataset from the user as input then predicts the reliability of the news.

4.6 Model Evaluation

Once fitting the model, we evaluate the performance of each model with the help of a confusion matrix. After comparing the accuracies of the four classifiers, the classifier which is performing best will be taken as a classification model for the detection of news. Most of the approaches will consider this as a problem of classification that estimates if the information is real or fake:

True Positive (TP): when anticipated information that is false is correctly grouped as fake news.

True Negative (TN): when anticipated information that is true is correctly grouped as true news.

False Negative (FN): when anticipated information that is true is not correctly grouped as fake news.

False Positive (FP): when anticipated information that is false is not correctly grouped as true news.

Table 2 Confusion matrix

	Class 1 (Predicted)	Class 2 (Predicted)
Class 1 (Actual)	TP	FN
Class 2 (Actual)	FP	TN

Figure 4 Blockchain structure.

Figure 5 Confusion matrix obtained from random forest algorithm.

Figure 6 Confusion matrix obtained from logistic regression.

Figure 7 Confusion matrix obtained from passive-aggressive classification.

Figure 8 Graph obtained from all the algorithms based on accuracy versus models.

Table 3 Accuracy of classifiers

Model	Accuracy
Random Forest Classifier	0.90
Logistic Regression	0.92
Passive Aggressive Classifier	0.94
Support Vector Machine	0.93

Figure 9 Predicting real news.

Figure 10 Predicting fake news as fake.

Confusion matrix: A table that outlines the performance of a classifier on a bunch of test information for which the genuine values are known. It is used to visualize algorithm performance. The incorrect and correct numbers of predictions are concluded with value count and further, each class will be broken. A confusion matrix is a summary of prediction results on a classification model. It describes how the model of classification will be confused when it makes predictions. It’s anything but an understanding not just into the mistakes that are made by the classifier yet more critically the sorts of blunders that are being made.

Formulas for Precision, Recall, F1 score, accuracy:

1. Precision $=$ TP/(TP $+$ FP)

2. Recall $=$ TP /(TP $+$ FN)

3. F1 Score $=$ 2 *((precision*recall) / (precision $+$ recall))

4. Accuracy $=$ (TP $+$ TN) / (TP $+$ TN $+$ FP $+$ FN)

These are the metrics used in machine learning which enable us to evaluate the performance of a classifier model.

5 Result and Discussion

This section, demonstrates the working of the system. It includes a comprehensible summary of the results of all critical tests that were carried out. Figure 4 shows the Blockchain structure.

Figure 5 shows the Confusion Matrix for predicted label and True label obtained from Random Forest Algorithm. The true label contains 2 labels are True and Fake.

Figure 6 shows the Confusion Matrix for predicted label and True label obtained from Logistic Regression Algorithm.

5.1 Snapshots of the System Working

Figure 9 shows that when the text is entered it is going to classify whether it is true or false. The below figure shows the text entered by the user is Real/True. Likewise, Figure 10 shows the text entered is False.

6 Conclusion

Fake news detection is a research area that has a lot of scopes and also has a large dataset. Our model is run against the existing dataset. From Table 2, we conclude Passive-aggressive algorithms show a Maximum Accuracy of up to 94%. So using this classifier we built our classification model for fake news detectors. The user can enter the text or keyword on the web page and check the reliability of the news.

In our future work, we are looking forward to building a dataset on our own which will be up to date and will have all the news accordingly. All the latest data and live news will be updated in the database and the subsequent stage is to train the model and break down how the exactness change with the new information to add further develop it.

References

[1] Iftikhar Ahmad et al. “Fake news detection using machine learning ensemble methods”. In: Complexity 2020 (2020).

[2] Monther Aldwairi and Ali Alwahedi. “Detecting fake news in social media networks”. In: Procedia Computer Science 141 (2018), pp. 215–222.

[3] Cody Buntain and Jennifer Golbeck. “Automatically identifying fake news in popular twitter threads”. In: 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. 2017, pp. 208–215.

[4] Nadia K Conroy, Victoria L Rubin, and Yimin Chen. “Automatic deception detection: Methods for finding fake news”. In: Proceedings of the association for information science and technology 52.1 (2015), pp. 1–4.

[5] Marco L Della Vedova et al. “Automatic online fake news detection combining content and social signals”. In: 2018 22nd Conference of Open Innovations Association (FRUCT). IEEE. 2018, pp. 272–279.

[6] Mykhailo Granik and Volodymyr Mesyura. “Fake news detection using naive Bayes classifier”. In: 2017 IEEE first Ukraine conference on electrical and computer engineering (UKRCON). IEEE. 2017, pp. 900–903.

[7] A Santhosh Kumar et al. “Fake News Detection on Social Media Using Machine Learning”. In: Journal of Physics: Conference Series. Vol. 1916. 1. IOP Publishing. 2021, p. 012235.

[8] Benjamin Markines, Ciro Cattuto, and Filippo Menczer. “Social spam detection”. In: Proceedings of the 5th international workshop on adversarial information retrieval on the web. 2009, pp. 41–48.

[9] Cade Metz. “The bittersweet sweepstakes to build an AI that destroys fake news”. In: Wired.com (2016).

[10] Rada Mihalcea and Carlo Strapparava. “The lie detector: Explorations in the automatic recognition of deceptive language”. In: Proceedings of the ACL-IJCNLP 2009 conference short papers. 2009, pp. 309–312.

[11] Shivam B Parikh and Pradeep K Atrey. “Media-rich fake news detection: A survey”. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR). IEEE. 2018, pp. 436–441.

[12] Kai Shu et al. “Fake news detection on social media: A data mining perspective”. In: ACM SIGKDD explorations newsletter 19.1 (2017), pp. 22–36.

[13] Kelly Stahl. “Fake news detection in social media”. In: California State University Stanislaus 6 (2018), pp. 4–15.

[14] William Yang Wang. “liar,liar pants on fire: A new benchmark dataset for fake news detection”. In: arXivpreprint arXiv:1705.00648 (2017).

[15] Jiawei Zhang, Bowen Dong, and S Yu Philip. “Fake detector: Effective fake news detection with deep diffusive neural network”. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE. 2020, pp. 1826–1829.

Biographies

H. L. Gururaj Gururaj is currently working as Associate Professor, Department of Information Technology, Manipal Institute of Technology – MIT, Manipal Academy Of Higher Education (MAHE), Bangalore Campus, India. He holds a Ph.D. Degree in Computer Science and Engineering from Visweswaraya Technological University, Belagavi, India in 2019. He is a professional member of ACM and working as ACM Distinguish Speaker from 2018. He is the founder of Wireless Internetworking Group(WiNG). He is a Senior member of IEEE and lifetime member of ISTE and CSI.

H. Lakshmi is working as an assistant professor in the department of Information science and engineering, Vidyavardhaka College of Engineering, Mysuru. She has published papers in various national and international conferences and journals.

B. C. Soundarya is working as an assistant professor at Alva’s Institute of Engineering and Technology Mangalore in the Department of AIML. She is a member of IEEE and member of ACM-W. She published research papers in various international journals and conferences.

Fransesco Flammini Since January 2020, Francesco Flammini is a Full Professor of Computer Science with a focus on Cyber-Physical Systems at Mälardalen University (MDH page).

He has been a Senior Lecturer and an Associate Professor (“Docent”) in Computer Science at the Department of Computer Science and Media Technology of Linnaeus University. He has led the Cyber-Physical Systems (CPS) research and education area within the Smarter Systems complete knowledge environment.

He got with honors his master (2003) and doctoral (2006) degrees in Computer Engineering from the University of Naples Federico II, Italy.

V. Janhavi is working as an associate professor in the department of computer science and engineering, Vidyavardhaka College of Engineering, Mysuru. She has published papers in various national and international conferences and journals.