Machine Learning Modeling: A New Way to do Quantitative Research in Social Sciences in the Era of AI

  • Jiaxing Zhang The Institute of Social Development Studies, Wuhan University, China; Shenzhen Qianhai Siwei Innovation Technology Ltd. Co., Shenzhen, China
  • Shuaishuai Feng School of Sociology, Wuhan University, Wuhan, Hubei, China
Keywords: Era of Artificial Intelligence, Machine Learning Modeling, Overfitting, Prediction Studies


Improvements in big data and machine learning algorithms have helped AI technologies reach a new breakthrough and have provided a new opportunity for quantitative research in the social sciences. Traditional quantitative models rely heavily on theoretical hypotheses and statistics but fail to acknowledge the problem of overfitting, causing the research results to be less generalizable, and further leading to societal predictions in the social sciences being ignored when they should have been meaningful. Machine learning models that use cross validation and regularization can effectively solve the problem of overfitting, providing support for the societal predictions based on these models. This paper first discusses the sources and internal mechanisms of overfitting, and then introduces machine learning modeling by discussing its high-level ideas, goals, and concrete methods. Finally, we discuss the shortcomings and limiting factors of machine learning models. We believe that using machine learning in social sciences research is an opportunity and not a threat. Researchers should adopt an objective attitude and make sure that they know how to combine traditional methods with new methods in their research based on their needs.


Download data is not yet available.

Author Biographies

Jiaxing Zhang, The Institute of Social Development Studies, Wuhan University, China; Shenzhen Qianhai Siwei Innovation Technology Ltd. Co., Shenzhen, China

Jiaxing Zhang is a researcher member of the Institute of Social Development Studies, Wuhan University, China. She majored in Big Data Mine and Analysis. She is also a chairman of Shenzhen Qianhai Siwei Innovation Technology Ltd. Co., Shenzhen, China, who is majored in Data Mining, Big Data Analysis, Block chain and computational social science research.

She attended the Wuhan University where she received her B.Sc. in Software Engineering in 2009. Jiaxing Zhang then went on to pursuit a M.Sc. in software Engineering from Wuhan University, China in 2011. After that, she got a M.Sc. in Digital Media from Wuhan University, China in 2013.

Jiaxing Zhang has held solution and software engineering senior positions at Shenzhen since 2014. And she got some awards from some other research institutes in her research areas. Her Ph.D. work centers on Block Chain Technology and Social Governance.

Shuaishuai Feng, School of Sociology, Wuhan University, Wuhan, Hubei, China

Shuaishuai Feng is a PhD candidate in sociology at Wuhan University. He received his bachelor’s degree and master’s degree in sociology from Northwest A&F University and Wuhan University respectively. His current focus is on computational social science research.


Ding Shengyong, Fan Yongbing, Editors. Solving Artificial Intelligence [M]. Beijing: People’s Posts and Telecommunications Press, 2018.

Kaplan O. Prediction in the Social Sciences [J]. Philosophy of Science 1940, 7(4):492–498.

Chen Yunsong, Wu Xiaogang, Hu Anning, He Guangye, Ju Guodong. Social prediction: a new research paradigm based on machine learning [J]. Sociology Research, 2020(3):94–117.

Babyak M.A. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models [J]. Psychosomatic Medicine, 2004, 66(3):411–421.

McNeish D.M. Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences [J]. Multivariate Behavioral Research, 2015, 50(5):471–484.

Xie Yu. Regression analysis. 2nd edition [M]. Social Sciences Literature Press, 2013.

Fomby T.B., Johnson S.R. and Hill R.C. Advanced Econometric Methods [M]. Springer-Verlag, 1984.

Chatterjee S. and Hadi A.S. Regression Analysis by Example, Fourth Edition [M]. Hoboken: John Wiley and Sons, 2006.

Chen Yunsong, Fan Xiaoguang. The Endogenous Problem in Sociological Quantitative Analysis to Estimate the Causal Effect of Social Interactions [J]. Society, 2010, 30(4):91–117.

Hu Anning. Propensity Value Matching and Causal Inference: A Review of Methodology [J]. Sociological Research, 2012, 000(001):221–242.

Hawkins D.M. The Problem of Overfitting [J]. Journal of Chemical Information & Modeling, 2004, 44(1):1–12.

Yarkoni T. and Westfall J. Choosing prediction over explanation in psychology: Lessons from machine learning [J]. Perspectives on Psychological Science A Journal of the Association for Psychological Science, 2017, 12(6):1100–1122.

Lever J., Krzywinski M. and Altman N. Points of Significance: Model selection and overfitting [J]. Nature Methods, 2016, 13(9):703–704.

Zhang Lijin, Wei Xiayan, Lu Jiaqi, Pan Junhao. Lasso regression: from explanation to prediction [J]. Advances in Psychological Science, 2020, 28(10):1777–1788.

Antonakis J., Bendahan S., Jacquart P. and Lalive R. On making causal claims: A review and Recommendations [J]. Leadership Quarterly, 2010, 21(6), 1086–1120.

Harrell F.E. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis [M]. Berlin: Springer, 2001.

Athey S. and Imbens G. Recursive partitioning for heterogeneous causal effects [J]. Proceedings of the National Academy of Sciences, 2016, 113(27): 7353–7360.

Hao Long, Li Fengxiang. Social Science Big Data Computing – The Core Issues of Computing Social Science in the Big Data Era [J]. Library Science Research, 2017(22):20–29+35.

Li Deyi, Editor. Introduction to Artificial Intelligence [M]. Beijing: China Science and Technology Press, 2018.

Jordan M.I. and Mitchell T.M. Machine learning: Trends, perspectives, and prospects [J]. Science, 2015, 349(6245):255–260.

Athey S. The Impact of Machine Learning on Economics [J]. a Chapter in the book The Economics of Artificial Intelligence: An Agenda [M], Ajay Agrawal, Joshua Gans, and Avi Goldfarb, editors, University of Chicago Press, 2019:507–547.

Li Hang. Statistical learning methods [M]. Beijing: Tsinghua University Press, 2012.

Lecun Y., Bengio Y. and Hinton G. Deep learning [J]. Nature, 2015, 521(7553):436.

Zhou Zhihua. A brief introduction to weakly supervised learning [J]. National Science Review, 2018(1):1.

Hesterberg T., Choi N.H., Meier L. and Fraley C. Least Angle and L1 Regression: A Review [J]. Statistics Surveys, 2008, 18(2):61–93.

Omid K. Discovery and Replication of Gene Influences on Brain Structure Using LASSO Regression [J]. Frontiers in Neuroscience, 2012(6):115.

Demjaha A., Lappin J.M., Stahl D., et al. Antipsychotic treatment resistance in first-episode psychosis: prevalence, subtypes and predictors [J]. Psychological medicine, 2017, 47(11):1981–1989.

Jiang Cuixia, Liu Yuye, Xu Qifa. Using Lasso quantile regression to find a hedge fund investment strategy [J]. Journal of Management Science, 2016, 19(3):107–126.

Yan Dawen, Chi Guotai and Lai Kin Keung. Financial Distress Prediction and Feature Selection in Multiple Periods by Lassoing Unconstrained Distributed Lag Non-linear Models [J]. Mathematics 2020, 8:1275.

Breiman L. Statistical Modeling: The Two Cultures[J]. Statistical Science, 2001, 16(3):199–231.

Bian Yanjie. Bringing Strong Ties Back in: Indirect Ties, Network Bridges, and Job Searches in China [J]. American Sociological Review, 1997, 62(3):366–385.

Obuchi T. and Kabashima Y. Cross validation in lasso and its acceleration [J]. Journal of Statistical Mechanics:Theory and Experiment, 2016(5):1–37.

Lazer D., Kennedy R., King G., et al. The Parable of Google Flu: Traps in Big Data Analysis [J]. Science, 2014, 343(6176):1203.

Advanced Practice in Web Engineering