Incomplete and Complete Data Processing on the Internet

By Syamantak Saha

IEEE Internet Policy Newsletter, December 2018

Internet provides a vast amount of data that can be used to gather information. However, it is important to effectively retrieve the data that is required without unnecessary distractions or confusions by introduction of unwanted results. The data domain, collection methods etc., determines whether the input data is complete or incomplete. Appropriate processing of the data can then be undertaken as per data engineering principles. Inappropriate processing can cause data hazards where there is a mismatch between input values and the expected or obtained results. Here it is important to note the differences between complete and incomplete datasets, so that appropriate processing can be undertaken.

Incomplete Data Characteristics

Data entered as input is taken not to be of absolute value.
It is assumed that the input data is sparse or incomplete and is not fully representing the required values.
This can happen in bad transmission lines or processes that are unable to fully capture the data inputs.
Computing techniques such as Artificial Intelligence is then applied to the incomplete data for processing^[1].
The anticipation of the incomplete data introduces probabilistic values in the input data set.
The results obtained from incomplete data processing thereby consists of probabilistic values that were introduced whilst completing the data values in the user provided input.
This means that users would then have to scroll through results that includes data that is not requested by the user even if the user had provided complete data for the search.

Complete Data Characteristics

Data entered is taken to be of absolute value.
It is assumed that the data entered fully consists of the required values.
Data engineering techniques are then applied that processes the data with full confidence and reliability.
No probabilistic values are introduced to complete data input values.
Results obtained are complete in as per the particular request of the user.
This means that users usually do not have to scroll through results that include any data that is not requested by the user.

Characteristics	Incomplete Data Inputs	Complete Data Inputs
Absolute Value	No	Yes
Fully Representational	No	Yes
Applied Artificial Intelligence	Yes	No
Data Anticipation	Yes	No
Probabilistic Values of Output	Yes	No
Manual Filtering of Output	Yes	No

Table 1: Summary of Complete and Incomplete Data Input Characteristics

Both methods have their use for particular engineering problems. Incomplete data processing is used in health technologies^{[2][3][4][5][6]}, financial markets^[7][8][9] etc., when all the input values are not readily available. In comparison, complete data processing is used widely in areas such as accounting systems^[10][11], automobiles, music recordings^{[12][13][14][15]} etc., when there is complete data readily available for processing.

Incomplete Data Processing

In health technologies when evaluating the conditions of a person for a particular illness, a risk aversive principle is applied to identify any positive conditions. This means that the health data is analyzed with the perspective that the condition could be true. If such analysis provides a margin of possibility for the condition to exist, a cure is prescribed. In terms of data, the collected reports would only contain a sample of the patients biological values. However, this report is not containing all the data in terms of metadata as well as the timeline for the data. So decisions are based on incomplete datasets that are completed with some kind of knowledge mostly from the database of previous patient conditions^{[2][3][4][5][6]}. So heuristically, artificial intelligence is applied to complete the collected sample that is incomplete. Similarly in financial markets, to buy or sell a particular asset is derived from sample market conditions such as current asset price, interest rates etc. However, again this metadata is not complete by definition and to a considerable extent also in values^[7][8][9]. To complete the missing values, market knowledge that is collected previously is applied to this incomplete dataset to determine a buy or a sell for the particular asset. This applied market knowledge is a manifested artificial intelligence that is applied to complete the dataset.

Complete Data Processing

Accounting systems provide an example of processing complete datasets. Here, invoices, purchase orders, general ledger data, accounts receivables and payables are taken to be complete in their metadata and values^[10][11]. There is no requirement or need to further apply any particular market knowledge to complete the required accounting values. So, no artificial intelligence is applied to process this complete dataset. Similarly, in music recordings, it is assumed that the provided data is complete and proactive effort is made to preserve the original recording without introduction of any noise^{[12][13][14][15]}. If artificial intelligence is applied to this recorded value, indeed it would corrupt the original recording, reducing its intended use.

Identification of Complete or Incomplete Data

So it is important to understand the requirements of the data to be processed as incomplete or complete data depending on the domain of use. Certainly, using an incomplete dataset processing for a complete data would be destructive of the original intended data, reducing its value and application. Similarly, not processing an incomplete data would provide results that are random and incomplete.

Contra Processing and Data Hazards

If complete dataset processing is applied to an incomplete dataset such as in the health domain, a patient can easily be over-prescribed or under-prescribed for a particular illness. For example, if the heart rate is measured for a patient after heavy exercise, it would represent a hypertension condition that may well not be true when the patient is rested. So only taking the value of ECG, although of primary value, is incomplete and is to be completed with consideration of further values. At a fixed or rested state, if a further value for the illness be considered, such as body temperature, that can be collected nominally, or symptoms such as dizziness etc., be added to the collected heart beat rate, it should then provide a better basis to form a decision on the particular illness condition and cure for the patient^[6]. So, a proactive effort to collect and apply artificial intelligence is used to complete the very sparse collected data for evaluation of the illness. In financial markets, the asset price is a core data to evaluate whether to buy or sell the asset. Here, asset price is incomplete data because in itself it does not have several other required information to make a complete decision. For example, if the market is growing then a buy is warranted even if the price is above expectations. Also, to expect how long the market growth would be sustained is important to determine quantity and time period for holding the asset^[7]. These additional factors determine the final buy or sell for the asset. However, such additional factors is a form of artificial intelligence and indeed financial firms have their own trade secrets in market knowledge built into the final algorithm that completes the dataset for making the decision^[8]. Without such applied artificial intelligence, financial firms would almost find it impossible to operate only by having the asset price as the consideration for the decision. Certainly, it would increase the risk for the firm to operate having repercussions for customers in the form of higher failure rates and lower than expected returns on their investments^[9]. Similarly, if incomplete data processing is applied to complete data, irregular and incorrect data would be obtained. For example, if the total invoices for a business is taken to be incomplete and further data added to it in the form of artificial intelligence, then it would have an undesirable effect on the data as it would not be accurate as per the particular requirements. Suppose the total invoices for the business is collected but market evaluation shows that the total invoices for such business is usually higher or lower by a particular percentage value. Applying this as value enhancement in any artificial intelligence would become a fraudulent activity and make the data unusable for any business or regulatory purpose. Also, when a song is recorded using various instruments such as vocals, drums, piano and guitar, the original recording is to be preserved for best recording of the song. If song evaluations show that at a particular time of a song, a ‘drum-intro’ should be applied and that a piano plays in particular durations, and such artificial intelligence algorithms be applied to the data, it would almost certainly reduce the original recorded value of the song. In the domain of music recordings, such artificial intelligence is thereby not a warranted introduction and is taken to reduce the efforts of the musicians involved in producing the music.

Implications for the Internet

Internet searching and its uses require resolving the realms of incomplete and complete data processing. Often, when undesirable results are obtained ^[16], it is caused by artificial intelligence processing of data that is mostly disparate to the domain of the data and its market. A startling factor is also predominant when an additional value is added to the user input data that bears unexpected result. The oblivion of human intelligence, especially when using computers, almost provides an opportunity for artificial intelligence applications to apply, profess and propagate policies, market values and practices that are not necessarily the user need rather an unwarranted imposition. Engineers certainly have to be aware of such risky areas of artificial intelligence and ensure that appropriate and domain specific processing is applied to particular complete and incomplete data that is intended to be used by society.

References:

^[1] Charu C. Aggarwal ; Philip S. Yu; A Survey of Uncertain Data Algorithms and Applications; IEEE Transactions on Knowledge and Data Engineering; Year: 2009 , Volume: 21 , Issue: 5; Pages: 609 – 623.

^[2] Scott A. Stevens ; Jesse Stimpson ; William D. Lakin ; Nimish J. Thakore ; Paul L. Penar ; A Model for Idiopathic Intracranial Hypertension and Associated Pathological ICP Wave-Forms; IEEE Transactions on Biomedical Engineering; Year: 2008 , Volume: 55 , Issue: 2; Pages: 388 – 398.

^[3] Roger J. Bagshaw ; Arnost Fronek ; Lysle H. Peterson ; Harry F. Zinsser; Dispersion of Blood Pressure and Heart Rate in Essential Hypertension; IEEE Transactions on Biomedical Engineering; Year: 1975 , Volume: BME-22 , Issue: 6; Pages: 508 – 512.

^[4] Mario Pascual Carrasco ; Carlos H. Salvador ; Pilar G. Sagredo ; JoaquÍn MÁrquez-Montes ; Miguel A. GonzÁlez de Mingo ; Juan A. Fragua ; Montserrat Carmona RodrÍguez ; Luis M. GarcÍa-Olmos ; Fernando GarcÍa-LÓpez ; Adolfo MuÑoz Carrero ; Jose L. Monteagudo; Impact of Patient–General Practitioner Short-Messages-Based Interaction on the Control of Hypertension in a Follow-up Service for Low-to-Medium Risk Hypertensive Patients: A Randomized Controlled Trial; IEEE Transactions on Information Technology in Biomedicine; Year: 2008 , Volume: 12 , Issue: 6; Pages: 780 – 791.

^[5] R. Hornero ; M. Aboy ; D. Abasolo ; J. McNames ; B. Goldstein ; Interpretation of approximate entropy: analysis of intracranial pressure approximate entropy during acute intracranial hypertension; IEEE Transactions on Biomedical Engineering; Year: 2005 , Volume: 52 , Issue: 10; Pages: 1671 – 1680.

^[6] Roger J. Bagshaw ; Arnost Fronek ; Lysle H. Peterson ; Harry F. Zinsser ; Dispersion of Blood Pressure and Heart Rate in Essential Hypertension; IEEE Transactions on Biomedical Engineering; Year: 1975 , Volume: BME-22 , Issue: 6; Pages: 508 – 512.

^[7] Guangwei Shi ; Liying Ren ; Zhongchen Miao ; Jian Gao ; Yanzhe Che ; Jidong Lu ; Discovering the Trading Pattern of Financial Market Participants: Comparison of Two Co-Clustering Methods; IEEE Access; Year: 2018 , Volume: 6; Pages: 14431 – 14438.

^[8] Konstantinos Drakakis; Application of signal processing to the analysis of financial data [In the Spotlight]; IEEE Signal Processing Magazine; Year: 2009 , Volume: 26 , Issue: 5; Pages: 160 – 158.

^[9] Wei Cao ; Longbing Cao; Financial Crisis Forecasting via Coupled Market State Analysis; IEEE Intelligent Systems; Year: 2015 , Volume: 30 , Issue: 2; Pages: 18 – 25.

^[10] Celia Desmond; Accounting for project management activities; IEEE Engineering Management Review; Year: 2014 , Volume: 42 , Issue: 3; Pages: 13 – 14.

^[11] G.L. Geerts ; W.E. McCarthy; Expert opinion [accounting]; IEEE Intelligent Systems and their Applications; Year: 1999 , Volume: 14 , Issue: 4; Pages: 89 – 94.

^[12] Nadine Kroher ; Emilia Gómez; Automatic Transcription of Flamenco Singing From Polyphonic Music Recordings; IEEE/ACM Transactions on Audio, Speech, and Language Processing; Year: 2016 , Volume: 24 , Issue: 5; Pages: 901 – 913.

^[13] Juan P. Bello; Measuring Structural Similarity in Music; IEEE Transactions on Audio, Speech, and Language Processing; Year: 2011 , Volume: 19 , Issue: 7; Pages: 2013 – 2025.

^[14] Peter Grosche ; Meinard Muller; Extracting Predominant Local Pulse Information From Music Recordings; IEEE Transactions on Audio, Speech, and Language Processing; Year: 2011 , Volume: 19 , Issue: 6; Pages: 1688 – 1701.

^[15] Wei-Ho Tsai ; Hsin-Min Wang; Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals; IEEE Transactions on Audio, Speech, and Language Processing; Year: 2006 , Volume: 14 , Issue: 1; Pages: 330 – 341.

^[16] H. Tirri; Search in vain, challenges for Internet search; Computer; Year: 2003 , Volume: 36 , Issue: 1; Pages: 115 – 116.

Syamantak Saha

Syamantak Saha has extensive experience as a data engineering consultant. He has worked as on several successful industry projects providing latest data engineering benefits to the clients. Syamantak has also actively participated in engineering research with interests being in areas such as big data, data lakes and the data on the internet.

Editor:

Ali Kashif Bashir

Ali Kashif Bashir (M’15, SM’16) is working as an Associate Professor in Faculty of Science and Technology, University of the Faroe Islands, Faroe Islands, Denmark. He received his Ph.D. degree in computer science and engineering from Korea University, South Korea. In the past, he held appointments with Osaka University, Japan; Nara National College of Technology, Japan; the National Fusion Research Institute, South Korea; Southern Power Company Ltd., South Korea, and the Seoul Metropolitan Government, South Korea. He is also attached to Advanced Network Architecture Lab as a joint researcher. He is supervising/co-supervising several graduate (MS and PhD) students. His research interests include: cloud computing, NFV/SDN, network virtualization, network security, IoT, computer networks, RFID, sensor networks, wireless networks, and distributed computing. He is serving as the Editor-in-chief of the IEEE INTERNET TECHNOLOGY POLICY NEWSLETTER and the IEEE FUTURE DIRECTIONS NEWSLETTER. He is an Editorial Board Member of journals, such as the IEEE ACCESS, the Journal of Sensor Networks, and the Data Communications. He has also served/serving as guest editor on several special issues in journals of IEEE, Elsevier, and Springer. He is actively involved in organizing workshops and conferences. He has chaired several conference sessions, gave several invited and keynote talks, and reviewed the technology leading articles for journals, such as the IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, the IEEE Communication Magazine, the IEEE COMMUNICATION LETTERS, IEEE Internet of Things, and the IEICE Journals, and conferences, such as the IEEE Infocom, the IEEE ICC, the IEEE Globecom, and the IEEE Cloud of Things.