DATA PREPARATION FOR FORECASTING ECONOMIC INDICATORS WITH MACHINE LEARNING
07.03.2023 22:13
[1. Информационные системы и технологии]
Автор: Bohdan Zakharchuk, postgraduate student, National Technical University of Ukraine "Ihor Sikorskyi Kyiv Polytechnic Institute"
In the last few years there has been a growing interest in cutting-edge technical stack, we are talking about data science. Some companies even don’t understand what Machine Learning, Big Data or Artificial Intelligence is, but to make some products using these technologies. Machine Learning is a sub-set of artificial intelligence, has been gaining popularity in commercial applications, as well as in government institutions. The growth of digitalization in all spheres of business and everyday life increases the quantity and quality of new data that can be processed and used to find complex models and algorithms for obtaining predictive analytics. Data science is a constantly evolving scientific discipline that aims at understanding data (both structured and unstructured) and searching for insights it carries. Data science takes advantage of big data and a wide array of different studies, methods, technologies, and tools including machine learning, AI, deep learning, and data mining. This scientific field highly relies on data analysis, statistics, mathematics, and programming as well as data visualization and interpretation. Everything mentioned helps data scientists make informed decisions based on data and determine how to gain value and relevant business insights from it.
In my investigation, Data Mining covers, at least, all first steps where I need to find raw data and then go step-by-step in data mining flow. Also, together with data mining I need to understand type of data which I will be able to collect - structured or unstructured. Structured data stands for information that is highly organized, factual, and to-the-point. It usually comes in the form of letters and numbers that fit nicely into the rows and columns of tables. Structured data commonly exists in tables like Excel files and Google Docs spreadsheets. Unstructured data doesn’t have any pre-defined structure to it and comes in all its diversity of forms. The examples of unstructured data vary from imagery and text files like PDF documents to video and audio files, to name a few. Structured data is often spoken of as quantitative data, meaning its objective and pre-defined nature allows us to easily count, measure, and express data in numbers. Unstructured data, alternately, is called qualitative data in the sense that it has a subjective and interpretive nature. This data can be categorized depending on its characteristics and traits. So, quantitative, and qualitative data stand for data nature. [1]
Fig. 1 Data Science discipline outlook
Regarding data sources, I have found up to 10 publicly available governmental and commercial resources where available structured historical data related to economic indicators of Ukraine. All resources mentioned above are using well-known file formats: xls, xlsx, csv and rarely zip-archives with csv inside. Such data can be easily downloaded and processed into first-level storage called data lake. [2]
Next step, what we can actually do with this raw data, how to prepare it for the next steps, but first of all we need to answer question, for what we are preparing data, for which task, so let’s try define which task we are trying to resolve and how machine learning can help us with that. I would say that ML 100% fit to my goal of investigation where I need to build model for predicting economic indicators for Ukrainian market. The core artifact of any machine learning execution is a mathematical model, which describes how an algorithm processes new data after being trained with a subset of historic data. The goal of training is to develop a model capable of formulating a target value (attribute), some unknown value of each data object. There are five groups of tasks that machine learning solves. In business terms, machine learning addresses a broad spectrum of tasks, but on the higher levels, the tasks that algorithms solve fall into five major groups: classification, cluster analysis, regression, ranking, and generation. For our purpose of building forecast of economic indicators matching task from regression group. Regression algorithms define numeric target values, instead of classes. By estimating numeric variables, these algorithms are powerful at predicting the product demand, sales figures, marketing returns, etc. For example: How many items of this product will we be able to sell next month? What’s going to be the fly fare for this air destination? Now, we understand task which we need to resolve, next step will be choose the most suitable training approach for machine learning model, and supervised learning approach seems the best match here. Supervised learning algorithms operate with historic data that already has target values. Mapping these target values in training datasets is called labeling. In other words, humans tell the algorithm what values to look for and which decisions are right or wrong. By looking at a label as an example of a successful prediction, the algorithm learns to find these target values in future data. Today, supervised machine learning is actively used both with classification and regression problems as generally target values are already available in training datasets. This makes supervised learning the most popular approach employed in business. For example, if you choose binary classification to predict the likelihood of lead conversion, you know which leads converted and which didn’t. You can label the target values (converted/not converted or 0/1) and further train a model. Supervised learning algorithms are also used in recognizing objects on pictures, in defining the mood of social media posts, and predicting numeric values as temperature, prices, etc. [3]
During literature review I have found out several interesting facts which might help me in research study, such as Primary and Secondary data types, Virtual storage and Blending data. Collectively, these articles outline a critical role for data collection with further validation before placing it to final data storage and development of Machine Learning model. In fact, data quality process became important before using data in predication exercise and should be paid enough attention to it. [4]
As far as I can see that I will be working with structured data, commonly with csv-format, maybe some other format as xml and json. Downloading files will be executed into some data lake, unstructured data storage. Then, I need to build some data pipeline to extract/prepare/validate data for training ML Math model with selected training approach for each economic indicator case. Since we will investigate historical and statistical economic data and we need to build predictive model on top of this data, this is sounds like Regression task and model can be trained in Supervised training approach.
References
1. Altexsoft, Software resource & development engineering – Structured vs. Unstructured data. 14.12.2020. URL:https://www.altexsoft.com/blog/structured-unstructured-data/.
2. Samarth Agarwal, 19.07.2019. Data Scientist’s toolkit — How to gather data from different sources. URL: https://towardsdatascience.com/data-scientists-toolkit-how-to-gather-data-from-different-sources-1b92067556b3 .
3. Altexsoft, Software resource & development engineering – ML description and difference with AI and Big Data - https://www.altexsoft.com/whitepapers/machine-learning-bridging-between-business-and-data-science/.
4. Eduard Hovy, Carnegie Mellon University, Jan ‘05 – Data Alignment and integration. URL: https://www.researchgate.net/publication/220478552_Data_Alignment_and_Integration