Extracting Algorithmic Complexity in Scientific Literature for Advance Searching
DOI:
https://doi.org/10.33919/JCAL.23.1.2Keywords:
algorithm search, algorithm complexity, information retrieval, non-textual document elements (NTDE)Abstract
Non-textual document elements such as charts, diagrams, algorithms and tables play an important role to present key information in scientific documents. Recent advances in information retrieval systems tap this information to answer more complex user queries by mining text pertaining to non-textual document elements from full text. Algorithms are critically important in computer science. Researchers are working on existing algorithms to improve them for critical application. Moreover, new algorithms for unsolved and newly faced problems are under development. These enhanced and new algorithms are mostly published in scholarly documents. The complexity of these algorithms is also discussed in the same document by the authors. Complexity of an algorithm is also an important factor for information retrieval (IR) systems. In this paper, we mine the relevant complexities of algorithms from full text document by comparing the metadata of the algorithm, such as caption and function name, with the context of the paragraph in which complexity related discussion is made by the authors. Using the dataset of 256 documents downloaded from CiteSeerX repository, we manually annotate 417 links between algorithms and their complexities. Further, we apply our novel rule-based approach that identifies the desired links with 81% precision, 75% recall, 78% F1-score and 65% accuracy. Overall, our method of identifying the links has potential to improve information retrieval systems that tap the advancements of full text and more specifically non-textual document elements.
References
Al-Zaidy, R. A. and Giles, C. L. (2017). A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, 31(2), 4644–4649. Avajlable at: https://doi.org/10.1609/aaai.v31i2.19088.
Bajracharya, S., Ossher, J. and Lopes, C. (2009). Sourcerer: An Internet-scale Software Repository. In: Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation. IEEE Computer Society, 1–4.
Baker, J. B., Sexton, A. P., Sorge, V. and Suzuki, M. (2011). Comparing Approaches to Mathematical Document Analysis from PDF. In: International Conference on Document Analysis and Recognition. IEEE, 463–467.
Bhatia, S. and Mitra, P. (2012). Summarizing Figures, Tables, and Algorithms in Scientific Publications to Augment Search Results. ACM Transactions on Information Systems (TOIS), 30(1), 3.
Bhatia, S., Mitra, P. and Giles, C. L. (2010). Finding Algorithms in Scientific Articles. In: Proceedings of the 19th International Conference on World Wide Web. NY: Association for Computing Machinery (ACM), 1061–1062.
Bhatia, S., Tuarob, S., Mitra, P. and Giles, C. L. (2011). An Algorithm Search Engine for Software Developers. In: Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation. NY: ACM, 13–16.
Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
Chen, H.-H., Gou, L., Zhang, X. and Giles, C. L. (2011). Collabseer: A Search Engine for Collaboration Discovery. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries. NY: ACM, 231–240.
Chen, P., Xie, H., Maslov, S. and Redner, S. (2007). Finding Scientific Gems with Google’s PageRank Algorithm. Journal of Informetrics, 1(1), 8–15.
Coüasnon, B. and Lemaitre, A. (2014). Recognition of Tables and Forms. In: Handbook of Document Image Processing and Recognition. London: Springer, 647–677.
Cormen, T. H. (2013). Algorithms Unlocked. The MIT Press.
Cormen, T. H., Leiserson, C. E., Rivest, R. L. and Stein, C. (2009). Introduction to Algorithms. 3rd ed. The MIT Press.
Elminaam, D. S., Abdual-Kader, H. M. and Hadhoud, M. M. (2010). Evaluating the Performance of Symmetric Encryption Algorithms. IJ Network Security, 10(3), 216–222.
Hearst, M. A., Divoli, A., Guturu, H., Ksikes, A., Nakov, P., Wooldridge, M. A. and Ye, J. (2007). BioText Search Engine: Beyond Abstract Search. Bioinformatics, 23(16), 2196–2197.
Hirschberg, D. S. (1975). A Linear Space Algorithm for Computing Maximal Common Subsequences. Communications of the ACM, 18(6), 341–343.
Jung, E., Elmallah, E. S. and Gouda, M. G. (2007). Optimal Dispersal of Certificate Chains. IEEE Transactions on Parallel and Distributed Systems, 18(4), 474–484.
Keogh, E., Chu, S., Hart, D. and Pazzani, M. (2001). An Online Algorithm for Segmenting Time Series. In: Proceedings 2001 IEEE International Conference on Data Mining. IEEE, 289–296.
Khabsa, M. and Giles, C. L. (2014). The Number of Scholarly Documents on the Public Web. PloS one, 9(5), e93949. Available at: https://doi.org/10.1371/journal.pone.0093949.
Khabsa, M., Treeratpituk, P. and Giles, C. L. (2012). Ackseer: A Repository and Search Engine for Automatically Extracted Acknowledgments from Digital Libraries. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries. NY: ACM, 185–194. .
Khan, S., Liu, X., Shakil, K. A. and Alam, M. (2017). A Survey on Scholarly Data: From Big Data Perspective. Information Processing & Management, 53(4), 923–944.
Kim, H.-S., Lee, J.-H. and Jeong, Y.-S. (2003). Method for Finding Shortest Path to Destination in Traffic Network Using Dijkstra Algorithm or Floyd-warshall Algorithm. Google Patents.
Kleinberg, J. and Tardos, É. (2009). Algorithm Design. Boston: Pearson/Addison-Wesley.
Kumar, V., Marathe, M. V., Parthasarathy, S. and Srinivasan, A. (2004). End-to-end Packet-scheduling in Wireless Ad-hoc Networks. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 1021–1030.
Lai, S., Xu, L., Liu, K. and Zhao, J. (2015). Recurrent Convolutional Neural Networks for Text Classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), 2267–2273. Avajlable at: https://doi.org/10.1609/aaai.v29i1.9513.
Li, H. (2013). Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM. arXiv.org e-Print arXiv:1303.3997. Available at: https://doi.org/10.48550/arXiv.1303.3997.
Liu, Y., Bai, K., Mitra, P. and Giles, C. L. (2007). Tableseer: Automatic Table Metadata Extraction and Searching in Digital Libraries. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. NY: ACM, 91–100.
Marron, M., Stefanovic, D., Hermenegildo, M. and Kapur, D. (2007). Heap Analysis in the Presence of Collection Libraries. In: Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering. NY: ACM, 31–36.
McMillan, C., Grechanik, M., Poshyvanyk, D., Fu, C. and Xie, Q. (2012). Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications. IEEE Transactions on Software Engineering, 38(5), 1069–1087.
Milidiú, R. L., Laber, E. S. and Pessoa, A. A. (1999). A Work-efficient Parallel Algorithm for Constructing Huffman Codes. In: Data Compression Conference, 1999. Proceedings. DCC’99. IEEE, 277–286.
Nadeem, A. and Javed, M. Y. (2005). A Performance Comparison of Data Encryption Algorithms. In: 2005 First International Conference on Information and Communication Technologies (ICICT). IEEE, 84–89.
Nargesian, F., Zhu, E., Pu, K. Q. and Miller, R. J. (2018). Table Union Search on Open Data. In: Proceedings of the VLDB Endowment, 11(7), 813–825.
Ratliff, N. D. and Bagnell, J. A. (2007). Kernel Conjugate Gradient for Fast Kernel Machines. In: Proceedings of 20th International Joint Conference on Artificial Intelligence (IJCAI '07), 1017–1022.
Safder, I. and Hassan, S.-U. (2018). DS4A: Deep Search System for Algorithms from Full-text Scholarly Big Data. In: International Conference on Data Mining Workshop (ICDMW), 1308–1315.
Safder, I., Hassan, S.-U. and Aljohani, N. R. (2018). AI Cognition in Searching for Relevant Knowledge from Scholarly Big Data, Using a Multi-layer Perceptron and Recurrent Convolutional Neural Network Model. In: Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, 251–258.
Safder, I., Sarfraz, J., Hassan, S.-U., Ali, M. and Tuarob, S. (2017). Detecting Target Text Related to Algorithmic Efficiency in Scholarly Big Data using Recurrent Convolutional Neural Network Model. In: Choemprayong, S., Crestani, F., Cunningham, S., (eds.). Digital Libraries: Data, Information, and Knowledge for Digital Lives. ICADL 2017. Cham: Springer, 30–40.
Siegel, N., Horvitz, Z., Levin, R., Divvala, S. and Farhadi, A. (2016). FigureSeer: Parsing Result-figures in Research Papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M., (eds.). Computer Vision – ECCV 2016, 9911, 664–680. Available at: https://link.springer.com/chapter/10.1007/978-3-319-46478-7_41.
Siegel, N., Lourie, N., Power, R. and Ammar, W. (2018). Extracting Scientific Figures with Distantly Supervised Neural Networks. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. NY: ACM, 223–232.
Stewart, J. G. (2009). Genre Oriented Summarization. [PhD diss.]. Carnegie Mellon University, Language Technologies Institute, School of Computer Science.
Ochieng, P. J., Djatna, T. and Kusuma, W. A. (2015). Tandem Repeats Analysis in DNA Sequences Based on Improved Burrows-Wheeler Transform. In: 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS). IEEE, 117–122.
Tuarob, S., Bhatia, S., Mitra, P. and Giles, C. L. (2013). Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning. In: 12th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 738–742.
Tuarob, S., Bhatia, S., Mitra, P. and Giles, C. L. (2016). AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data. IEEE Transactions on Big Data, 2(1), 3–17.
Tuarob, S., Mitra, P. and Giles, C. L. (2015). A Hybrid Approach to Discover Semantic Hierarchical Sections in Scholarly Documents. In: 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1081–1085.
Tyagi, N. and Sharma, S. (2012). Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page. International Journal of Soft Computing and Engineering (IJSCE), 2(3), 2231–2307. Available at: https://www.ijsce.org/wp-content/uploads/papers/v2i3/C0796062312.pdf.
Wang, J. (2009). Mean-variance Analysis: A New Document Ranking Theory in Information Retrieval. European Conference on Information Retrieval. Springer, 4–16.
Wise, M. J. (1995). Neweyes: A System for Comparing Biological Sequences Using the Running Karp-Rabin Greedy String-Tiling Algorithm. In: Proceedings. International Conference on Intelligent Systems for Molecular Biology, 3, 393–401.
Wu, J., Williams, K. M., Chen, H.-H., Khabsa, M., Caragea, C., Tuarob, S., Ororbia, А., D Jordan, D. and Giles, C. L. (2015). Citeseerx: AI in a Digital Library Search Engine. AI Magazine, 36(3), 35–48.
Yang, Y., Yu, P. and Gan, Y. (2011). Experimental Study on the Five Sort Algorithms. In: 2011 Second International Conference on Mechanic Automation and Control Engineering (MACE). IEEE, 1314–1317.
Zanibbi, R. and Blostein, D. (2012). Recognition and Retrieval of Mathematical Expressions. International Journal on Document Analysis and Recognition (IJDAR), 15(4), 331–357.