Machine Learning Modeling: A New Way to do Quantitative Research in Social Sciences in the Era of AI
Improvements in big data and machine learning algorithms have helped AI technologies reach a new breakthrough and have provided a new opportunity for quantitative research in the social sciences. Traditional quantitative models rely heavily on theoretical hypotheses and statistics but fail to acknowledge the problem of overfitting, causing the research results to be less generalizable, and further leading to societal predictions in the social sciences being ignored when they should have been meaningful. Machine learning models that use cross validation and regularization can effectively solve the problem of overfitting, providing support for the societal predictions based on these models. This paper first discusses the sources and internal mechanisms of overfitting, and then introduces machine learning modeling by discussing its high-level ideas, goals, and concrete methods. Finally, we discuss the shortcomings and limiting factors of machine learning models. We believe that using machine learning in social sciences research is an opportunity and not a threat. Researchers should adopt an objective attitude and make sure that they know how to combine traditional methods with new methods in their research based on their needs.
Ding Shengyong, Fan Yongbing, Editors. Solving Artificial Intelligence [M]. Beijing: People’s Posts and Telecommunications Press, 2018.
Kaplan O. Prediction in the Social Sciences [J]. Philosophy of Science 1940, 7(4):492–498.
Chen Yunsong, Wu Xiaogang, Hu Anning, He Guangye, Ju Guodong. Social prediction: a new research paradigm based on machine learning [J]. Sociology Research, 2020(3):94–117.
Babyak M.A. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models [J]. Psychosomatic Medicine, 2004, 66(3):411–421.
McNeish D.M. Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences [J]. Multivariate Behavioral Research, 2015, 50(5):471–484.
Xie Yu. Regression analysis. 2nd edition [M]. Social Sciences Literature Press, 2013.
Fomby T.B., Johnson S.R. and Hill R.C. Advanced Econometric Methods [M]. Springer-Verlag, 1984.
Chatterjee S. and Hadi A.S. Regression Analysis by Example, Fourth Edition [M]. Hoboken: John Wiley and Sons, 2006.
Chen Yunsong, Fan Xiaoguang. The Endogenous Problem in Sociological Quantitative Analysis to Estimate the Causal Effect of Social Interactions [J]. Society, 2010, 30(4):91–117.
Hu Anning. Propensity Value Matching and Causal Inference: A Review of Methodology [J]. Sociological Research, 2012, 000(001):221–242.
Hawkins D.M. The Problem of Overfitting [J]. Journal of Chemical Information & Modeling, 2004, 44(1):1–12.
Yarkoni T. and Westfall J. Choosing prediction over explanation in psychology: Lessons from machine learning [J]. Perspectives on Psychological Science A Journal of the Association for Psychological Science, 2017, 12(6):1100–1122.
Lever J., Krzywinski M. and Altman N. Points of Significance: Model selection and overfitting [J]. Nature Methods, 2016, 13(9):703–704.
Zhang Lijin, Wei Xiayan, Lu Jiaqi, Pan Junhao. Lasso regression: from explanation to prediction [J]. Advances in Psychological Science, 2020, 28(10):1777–1788.
Antonakis J., Bendahan S., Jacquart P. and Lalive R. On making causal claims: A review and Recommendations [J]. Leadership Quarterly, 2010, 21(6), 1086–1120.
Harrell F.E. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis [M]. Berlin: Springer, 2001.
Athey S. and Imbens G. Recursive partitioning for heterogeneous causal effects [J]. Proceedings of the National Academy of Sciences, 2016, 113(27): 7353–7360.
Hao Long, Li Fengxiang. Social Science Big Data Computing – The Core Issues of Computing Social Science in the Big Data Era [J]. Library Science Research, 2017(22):20–29+35.
Li Deyi, Editor. Introduction to Artificial Intelligence [M]. Beijing: China Science and Technology Press, 2018.
Jordan M.I. and Mitchell T.M. Machine learning: Trends, perspectives, and prospects [J]. Science, 2015, 349(6245):255–260.
Athey S. The Impact of Machine Learning on Economics [J]. a Chapter in the book The Economics of Artificial Intelligence: An Agenda [M], Ajay Agrawal, Joshua Gans, and Avi Goldfarb, editors, University of Chicago Press, 2019:507–547.
Li Hang. Statistical learning methods [M]. Beijing: Tsinghua University Press, 2012.
Lecun Y., Bengio Y. and Hinton G. Deep learning [J]. Nature, 2015, 521(7553):436.
Zhou Zhihua. A brief introduction to weakly supervised learning [J]. National Science Review, 2018(1):1.
Hesterberg T., Choi N.H., Meier L. and Fraley C. Least Angle and L1 Regression: A Review [J]. Statistics Surveys, 2008, 18(2):61–93.
Omid K. Discovery and Replication of Gene Influences on Brain Structure Using LASSO Regression [J]. Frontiers in Neuroscience, 2012(6):115.
Demjaha A., Lappin J.M., Stahl D., et al. Antipsychotic treatment resistance in first-episode psychosis: prevalence, subtypes and predictors [J]. Psychological medicine, 2017, 47(11):1981–1989.
Jiang Cuixia, Liu Yuye, Xu Qifa. Using Lasso quantile regression to find a hedge fund investment strategy [J]. Journal of Management Science, 2016, 19(3):107–126.
Yan Dawen, Chi Guotai and Lai Kin Keung. Financial Distress Prediction and Feature Selection in Multiple Periods by Lassoing Unconstrained Distributed Lag Non-linear Models [J]. Mathematics 2020, 8:1275.
Breiman L. Statistical Modeling: The Two Cultures[J]. Statistical Science, 2001, 16(3):199–231.
Bian Yanjie. Bringing Strong Ties Back in: Indirect Ties, Network Bridges, and Job Searches in China [J]. American Sociological Review, 1997, 62(3):366–385.
Obuchi T. and Kabashima Y. Cross validation in lasso and its acceleration [J]. Journal of Statistical Mechanics:Theory and Experiment, 2016(5):1–37.
Lazer D., Kennedy R., King G., et al. The Parable of Google Flu: Traps in Big Data Analysis [J]. Science, 2014, 343(6176):1203.