• WEN MA School of Computer Engineering and Science, Shanghai University, Shanghai, China
  • XIANGFENG LUO School of Computer Engineering and Science, Shanghai University, Shanghai, China
  • JUNYU XUAN Faculty of Engineering and Information Technology, University of Technology, Sydney (UTS) Australia
  • RUIRONG XUE School of Computer Engineering and Science, Shanghai University, Shanghai, China
  • YIKE GUO Department of Computing, Imperial College London, London, UK


Patent topic discovery, Latent Dirichlet Allocation, Backbone Association Link Network, Domain knowledge


Patent topic discovery is critical for innovation-oriented enterprises to hedge the patent application risks and raise the success rate of patent application. Topic models are commonly recognized as an efficient tool for this task by researchers from both academy and industry. However, many existing well-known topic models, e.g., Latent Dirichlet Allocation (LDA), which are particularly designed for the documents represented by word-vectors, exhibit low accuracy and poor interpretability on patent topic discovery task. The reason is that 1) the semantics of documents are still under-explored in a specific domain 2) and the domain background knowledge is not successfully utilized to guide the process of topic discovery. In order to improve the accuracy and the interpretability, we propose a new patent representation and organization with additional inter-word relationships mined from title, abstract, and claim of patents. The representation can endow each patent with more semantics than word-vector. Meanwhile, we build a Backbone Association Link Network (Backbone ALN) to incorporate domain background semantics to further enhance the semantics of patents. With new semantic-rich patent representations, we propose a Semantic LDA model to discover semantic topics from patents within a specific domain. It can discover semantic topics with association relations between words rather than a single word vector. At last, accuracy and interpretability of the proposed model are verified on real-world patents datasets from the United States Patent and Trademark Office. The experimental results show that Semantic LDA model yields better performance than other conventional models (e.g., LDA). Furthermore, our proposed model can be easily generalized to other related text mining corpus.


Download data is not yet available.


Wang W M, Cheung C F. A Semantic-based Intellectual Property Management System (SIPMS) for

supporting patent analysis[J]. Engineering Applications of Artificial Intelligence, 2011, 24(8):


Feng L, Peng Z, Liu B, et al. Finding Novel Patents Based on Patent Association[C]//International

Conference on Web-Age Information Management. Springer International Publishing, 2014: 5-17.

Venugopalan S, Rai V. Topic based classification and pattern identification in patents[J].

Technological Forecasting and Social Change, 2015, 94: 236-250.

Chen H, Zhang G, Zhu D, et al. A patent time series processing component for technology

intelligence by trend identification functionality[J]. Neural Computing and Applications, 2015,

(2): 345-353.

Noh H, Jo Y, Lee S. Keyword selection and processing strategy for applying text mining to patent

analysis[J]. Expert Systems with Applications, 2015, 42(9): 4348-4360.

Hu Z, Fang S, Liang T. Empirical study of constructing a knowledge organization system of patent

documents using topic modeling[J]. Scientometrics, 2014, 100(3): 787-799.

Montecchi T, Russo D, Liu Y. Searching in Cooperative Patent Classification: Comparison between

keyword and concept-based search[J]. Advanced Engineering Informatics, 2013, 27(3): 335-345.

Park S, Jun S. New technology management using time series regression and clustering[J].

International Journal of Software Engineering and Its Applications, 2012, 6(2): 155-160.

Kim K, Khabsa M, Giles C L. Inventor Name Disambiguation for a Patent Database Using a

Random Forest and DBSCAN[C]//Proceedings of the 16th ACM/IEEE-CS on Joint Conference on

Digital Libraries. ACM, 2016: 269-270.

Kang I S, Na S H, Kim J, et al. Cluster-based patent retrieval[J]. Information processing &

management, 2007, 43(5): 1173-1182.

Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. Journal of machine Learning

research, 2003, 3(Jan): 993-1022.

Hofmann T. Probabilistic latent semantic indexing[C]//Proceedings of the 22nd annual

international ACM SIGIR conference on Research and development in information retrieval.

ACM, 1999: 50-57.

Supraja A M, Archana S, Suvetha S, et al. Patent search and trend analysis[C]//Advance

Computing Conference (IACC), 2015 IEEE International. IEEE, 2015: 501-506.

Luo X, Xu Z, Yu J, et al. Building association link network for semantic link on web resources[J].

IEEE transactions on automation science and engineering, 2011, 8(3): 482-494.

Tang J, Wang B, Yang Y, et al. PatentMiner: topic-driven patent analysis and

mining[C]//Proceedings of the 18th ACM SIGKDD international conference on Knowledge

discovery and data mining. ACM, 2012: 1366-1374.

Venugopalan S, Rai V. Topic based classification and pattern identification in patents[J].

Technological Forecasting and Social Change, 2015, 94: 236-250.

Kim G, Park S, Jang D. Technology analysis from patent data using latent dirichlet

allocation[M]//Soft Computing in Big Data Processing. Springer International Publishing, 2014:


Du L, Buntine W, Jin H. A segmented topic model based on the two-parameter Poisson-Dirichlet

process[J]. Machine learning, 2010, 81(1): 5-19.

Xuan J, Lu J, Zhang G, et al. Topic model for graph mining[J]. IEEE transactions on cybernetics,

, 45(12): 2792-2803.

Kim Y G, Suh J H, Park S C. Visualization of patent analysis for emerging technology[J]. Expert

Systems with Applications, 2008, 34(3): 1804-1812.

Che H C, Wang S Y, Lai Y H. Assessment of patent legal value by regression and backpropagation

neural network[J]. International Journal of Systematic Innovation, 2010, 1(1).

Shih M J, Liu D R. Patent Classification Using Ontology-Based Patent Network

Analysis[C]//PACIS. 2010: 95.

Chen H, Zhang G, Lu J, et al. A fuzzy approach for measuring development of topics in patents

using Latent Dirichlet Allocation[C]//Fuzzy Systems (FUZZ-IEEE), 2015 IEEE International

Conference on. IEEE, 2015: 1-7.

Liu Y, Borhan N, Luo X, et al. Association Link Network Based Core Events Discovery on the

Web[C]//Computational Science and Engineering (CSE), 2013 IEEE 16th International

Conference on. IEEE, 2013: 553-560.

Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications

of the ACM, 1975, 18(11): 613-620.

Luo X, Zhang J, Ye F, et al. Power series representation model of text knowledge based on human

concept learning[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2014, 44(1):


Heinrich G. Parameter estimation for text analysis[J]. University of Leipzig, Tech. Rep, 2008.

Forman G. An extensive empirical study of feature selection metrics for text classification[J].

Journal of machine learning research, 2003, 3(Mar): 1289-1305.

Griffiths T L, Steyvers M. Finding scientific topics[J]. Proceedings of the National academy of

Sciences, 2004, 101(suppl 1): 5228-5235.

Zhang M L, Wu L. LIFT: Multi-label learning with label-specific features[J]. IEEE transactions on

pattern analysis and machine intelligence, 2015, 37(1): 107-120.

Cabral R, De la Torre F, Costeira J P, et al. Matrix completion for weakly-supervised multi-label

Image classification[J]. IEEE transactions on pattern analysis and machine intelligence, 2015,

(1): 121-135.

Ng B, Li F W B, Lau R W H, et al. A performance study on multi-server DVE systems[J].

Information Sciences, 2003, 154(1): 85-93.

Li F W B, Li L W F, Lau R W H. Supporting continuous consistency in multiplayer online

games[C]//Proceedings of the 12th annual ACM international conference on Multimedia. ACM,

: 388-391.

Yan T, Lau R W H, Xu Y, et al. Depth mapping for stereoscopic videos[J]. International Journal of

Computer Vision, 2013, 102(1-3): 293-307.