英语网站建设费用如何优化网站首页
2026/5/19 0:15:52 网站建设 项目流程
英语网站建设费用,如何优化网站首页,汕头市企业网站建设教程,做外贸兼职的网站设计温州商学院本科毕业设计#xff08;论文#xff09;外文翻译毕业设计#xff08;论文#xff09;题目#xff1a;姓 名学 号指导教师班 级19计算机本*原文题目#xff1a;《EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecos…温州商学院本科毕业设计论文外文翻译毕业设计论文题目姓 名学 号指导教师班 级19计算机本*原文题目《EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem》作者Panagiotis K ,Argyro M ,Athanasios K原文出处Panagiotis K ,Argyro M ,Athanasios K , et al.EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem[J].Information,2023,14(2):93-93.EverAnalyzer一个利用Hadoop生态系统的自调整大数据管理平台摘 要大数据是一种影响当今世界的现象每秒钟都会产生新的数据。如今的企业面临着来自日益多样化的数据以及索引、搜索和分析如此庞大的数据的重大挑战。在这种情况下存在一些用于处理和分析大数据的框架和库。在这些框架中Hadoop MapReduce、Mahout、Spark和MLlib似乎是最受欢迎的尽管尚不清楚它们中的哪一个最适合并在各种数据处理和分析场景中执行。本文提出了EverAnalyzer这是一个可自我调整的大数据管理平台旨在通过利用所有这些框架来填补这一空白。该平台能够以流式和批量方式收集数据利用从用户处理和分析过程中获得的元数据来收集数据。基于这些元数据平台为用户旨在执行的数据处理/分析活动推荐了最佳框架。为了验证该平台的效率使用了30个与各种疾病相关的不同数据集进行了大量实验。结果显示EverAnalyzer在80%的情况下正确地提出了最佳框架表明该平台在大多数实验中做出了最佳选择。关键词大数据数据管理数据收集数据分析数据处理HadoopMapReduce火花象夫MLlib简介由于物联网IoT的发展和社交媒体的广泛使用全球互联网消费有所增加。由于这一增长积累了大量数据在大多数情况下极难处理。根据Statista[1]的数据2020年全球消耗的数据总量已增至64.2泽塔字节2021年增至79泽塔字节。预计到2025年数据总量将增加180泽塔字节以上。与此同时Forbes[2]估计到2025年分析将需要超过150 Zettabytes的实时数据。据《福布斯》报道与处理非结构化数据的公司相比处理结构化数据的企业有不同的要求。《福布斯》发现超过95%的组织在管理非结构化数据集方面需要帮助。所有这些信息都被称为大数据它被定义为从多种来源和格式收集的大量数据[3]。许多企业收集和分析来自各种来源的数据以便就其客户、市场需求和趋势做出更好的商业决策。出于这些目的已经创建了各种大数据处理和分析技术以有效地从这些大型数据集中提取信息从而成功地评估底层数据[4]。在这些工具中在Apache Hadoop生态系统上创建的工具是使用最广泛的[5]。Hadoop已经成为信息技术IT商业和学术环境中最知名的工具之一因为它能够管理大量数据。然而随着现代互联网用户生成大量非结构化数据对内存资源的需求也在增加[6]分布式数据处理很好地满足了对内存资源增加的需求[7]。在这方面用于数据处理分发的两个最广泛使用的工具是MapReduce[8]和Spark[9]的开源工具它们为处理和分析大量数据提供了有效的解决方案同时为开发人员提供了有用的功能开发人员可以通过应用编程接口API轻松利用这些功能[10]。这两个工具都基于Hadoop生态系统其中MapReduce用于并行处理集群中的数据而Spark是另一个为集群数据处理构建的解决方案[11]。然而Spark的主要目的是提供一种编程模型该模型可用于受MapReduce功能约束的任何形式的大数据应用程序同时保持容错[12]。Spark不仅是MapReduce的替代方案而且还提供了各种实时数据处理功能。上述工具是Mahout[13]和MLlib[14]的工具的基础它们用于使用机器学习ML算法进行大数据分析[15]。本研究的目的是开发和部署EverAnalyzer这是一个灵活的大数据管理平台能够自动收集、预处理、处理和分析实时即流式和存储即批处理数据。尽管如此大多数现有的大数据管理平台已经支持这样一个管道然而它们利用了现成的技术和工具。此外这些平台支持执行独立任务的工具例如单个数据处理或单个数据分析任务。因此使用这些平台可以利用特定的框架这些框架有自己的优点、缺点和局限性。这个问题的解决方案是实现一个系统该系统可以理解用于管理不同案例数据集以进行处理或分析活动的各种工具的优点和缺点并为每个案例确定最佳工具以执行耗时更少、效率更高的行动。EverAnalyzer正是为了弥补这一差距提供了创新使其系统能够自动识别哪种底层数据处理即MapReduce或Spark和数据分析即Mahout或MLlib工具最适合成功高效地处理和分析摄入的数据。系统的选择不仅受数据量的影响还受应用于相关数据场景的先前处理和分析任务的执行速度的影响。因此EverAnalyzer可以应用于广泛的场景更好地帮助用户处理和分析活动从而减少他们的总体工作量。为了验证上述所有内容通过一项实验对该平台进行了评估该实验评估了EverAnalyzer向用户提供关于他们希望执行的操作所使用的最佳框架的经验建议的能力。数据是从三十30个不同的数据集中收集的这些数据集与医疗保健部门的各种疾病和状况有关。数据经过预处理、处理和分析而EverAnalyzer则根据请求的处理/分析过程的最短执行时间为最合适的框架即处理任务分别为MapReduce或Spark分析任务分别为Mahout或MLlib提供了建议。收集了该框架的所有建议并将其与所选两个工具之间执行时间最好的框架进行了比较结果表明EverAnalyzer在80%的时间内提出了正确的建议。然而当数据集数量增加时这一百分比似乎单调攀升。这意味着每个执行的处理/分析任务都会训练EverAnalyzer导出更好、更具代表性的结果。因此如果平台使用更多的数据集预计正确答案的百分比将增加从而将整个平台的准确性提高到80%以上。本文的其余部分组织如下。第2节详细总结了为评估研究的有意义见解而进行的文献综述重点是大数据及其寿命特别是处理和分析阶段。在第3节中对所提出的平台EverAnalyzer的设计和构建进行了全面分析包括平台的目标、用户及其架构。第4节描述了EverAnalyzer生成的实验结果第5节提供了导出结果的解释以及如何根据研究文献对其进行解释。最后第6节包含了本研究的结论、局限性、下一步行动和未来的研究方向它还描述了使用EverAnalyzer的设计和实现指南进行的未来实验。文献综述大数据被定义为从各种来源以各种格式收集的大量数据[3]。这类数据具有一些特定的特征数据的Vs主要指数据量即数据大小、多样性即数据格式、速度即数据产生率、准确性即数据真实性的大小、有效性即资料有效性、波动性即资料验证时间和价值即数据在分析方面的有用性[3]。这些特征表明大数据的管理具有挑战性但如果管理得当它可能会非常有价值。为此公司可以使用大数据来评估和提取有关其产品和客户的重要信息。然而由于它们的形式和大小广泛分析它们有时是一项复杂而耗时的任务。与此同时人们越来越多地使用互联网来帮助他们进行日常活动和娱乐这导致收集的数据量逐年增加。这导致数据可能是结构化的、半结构化的甚至是非结构化的这使得它们很难用传统的关系数据库管理系统RDBMS进行管理而实现这些系统既昂贵又耗时[16]。结构化数据是指已知其包含的信息及其包含方式的数据。另一方面半结构化数据缺乏关于其所包含信息的一些规范而非结构化数据不传达关于其结构的信息。手机、传感器、全球定位系统GPS信号、社交媒体和其他每秒产生大量数据的来源可以收集大量这些数据[17]。因此大数据是指从需要一些处理或分析活动的现成数据集中获得的批数据例如从外部系统数据库中获得的已存储数据或从不断流式传输信息的实时来源中获得的流式传输数据例如从社交媒体收集的实时数据[18]。因此在大数据的整个生命周期中管理大数据已成为一项极具挑战性的任务这项任务从未停止过激发企业和研究人员的兴趣。更具体地说大数据的利用由一个生命周期来表示该生命周期包括过多的阶段从收集数据开始到最终销毁数据结束[19]。图1描述了所有这些阶段指的是i收集其中数据是从各种来源收集的大多数时候的格式由于其非结构化性质而难以处理ii存储器其中摄取的数据被存储在适当的数据库中iii处理在标准结构中对数据进行预处理使其更容易在后续阶段进行管理iv分析其中使用各种ML方法从存储的数据中产生有意义的结果和见解v 利用将提取的结果和获得的见解用于各种现实生活和测试场景vi销毁这是整个生命周期的最后一个也是最重要的阶段因为在收集阶段许多敏感数据可能从各种来源收集要求数据遵守严格的协议以确保其机密性、完整性和可用性不受损害。为此应该强调的是建议的平台的目的是调查收集、存储、处理和分析的各个阶段下文将对此进行进一步分析。大数据收集大数据收集被描述为收集大量数据以进一步分析并获得有用结果的过程[2021]。这些数据可以使用问卷调查和访谈等传统方法收集然而还有许多更有效的方法。网络服务、配备传感器的设备如手机和平板电脑以及智能交通卡只是几个例子[22]。从这些设备收集的所有数据可以是批处理的这意味着它们被收集到预定义的大小然后被存储在一起以便稍后作为一组数据进行分析也可以是流式的指的是在收集时被分析的数据。这两种数据之间的区别在于流式数据处理直接应用于摄入的数据而批处理数据处理收集并预处理预定量的数据[18]。此外如果无法为处理/分析活动收集足够的数据则有创建合成数据的方法[23]合成数据代表分析最有可能用于正确执行所需分析的真实数据。已经建立了各种工具如Sebek[24]、Hflow[25]、Honeywall[26]、Nepenthes[27]、Kojoney[28]和Capture HPC[29]以成功收集这种不同类型和格式的数据。Kafka[30]和Flume[31]是使用最广泛的两种数据收集工具。Kafka是一种流式数据收集和处理工具Flume主要用于管理将流式数据作为批处理数据收集的基础设施。Flafka是通过结合这两种工具创建的能够利用Kafka和Flume将流数据保存为批处理数据[32]。大数据存储大数据存储被描述为在保持数据访问可靠性和可用性的同时存储和管理大规模数据集的过程[3334]。大数据存储对希望采用它的系统的基础设施有着重大影响。一方面存储基础设施必须为存储服务提供可靠的空间但另一方面它还必须提供用于查询和分析大量数据的动态访问接口。由于大数据的数量在不断扩大越来越多地使用被称为数据库管理系统DBMS的复杂系统来存储和管理这些数据。结构化查询语言SQL系统和非SQLNoSQL系统是RDBS的两种代表性类型[35]。NoSQL系统更适合存储和管理大数据因为SQL系统需要有组织的数据才能高效而NoSQL系统则用于非结构化数据。为了更好地管理现有非结构化数据的各种形式NoSQL数据库管理系统分为三个独立的核心类别即i键值存储将数据存储为键值对的集合其中键作为唯一标识符键和值的范围从简单对象到复杂复合对象例如Redis[36]Scalaris[37]Tokyo Tyrant[38]Riak[39]ii文档存储其是用于以文档形式存储信息的数据库例如SimpleDB[40]、CouchDB[41]、MongoDB[42]、Terrastore[43]iii使用表、行和列的列存储但与关系数据库不同同一表中的列的名称和格式可能因行而异例如Bigtable[44]、HBase[45]、HyperTable[46]、Cassandra[47]。大数据处理大数据处理是一组访问大量数据以提取有意义的信息用于决策支持和提供的技术[4849]。大数据处理采用了一系列方法如字数和字符串匹配这些方法可以分布在庞大的处理单元集群中[50]。数据处理算法通常具有较低的算法复杂性允许它们执行快速计算。它们易于实现可以解释各种数据集而由于其高速性它们可以用于任何数据集无论其大小。然而直接获得的数据集即原始数据集通常不可能作为数据处理任务进行处理因为在大数据的情况下这些数据集不符合特定的结构因为它们来源广泛。因此在进行数据处理工作之前大数据必须首先经过数据预处理阶段以规范数据结构。在数据结构被规范化之后使用优选的数据处理算法来处理数据是简单的。同时传统的编程范式无法有效地处理数据因为数据通常存储在数千个商品服务器上。因此新的并行编程方法正在数据中心部署以提高NoSQL数据库的性能[48]。MapReduce是一种流行的大规模商品集群大数据处理编程模型它已发展成为Hadoop生态系统的重要组成部分[48]。这种编程模型的主要优点是其简单性允许用户轻松利用它执行大数据处理任务[51]。Pig是一个类似SQL的环境用于对大数据执行处理任务[52]而Hive是这种工具的另一个例子它提供了比MapReduce更好的环境并简化了代码开发因为程序员不需要处理MapReduce编码的复杂性[53]。同样已经开发了许多解决方案来解决MapReduce的差距例如延迟的数据加载和数据重用。其中包括Starfish这是一个基于Hadoop的框架旨在通过使用数据生命周期分析来提高MapReduce作业的性能也是一个适应用户需求和系统工作负载的自调整系统无需用户配置或更改底层设置或参数[54]。Spark是MapReduce的替代方案旨在克服磁盘I/O限制并提高以前解决方案的性能。执行内存中计算的能力是Spark的主要特点因为它可以将数据缓存在内存中消除了MapReduce对迭代任务的磁盘开销限制[55]。其他类似于MapReduce的编程模型包括Dryad它是一个用于运行基于定向非循环图DAG的大数据应用程序的分布式执行引擎。虽然MapReduce只允许一组输入和输出数据但Dryad允许用户使用任何数量的输入和输出资料[56]。Pregel是另一种能够处理大规模图形用于各种目的的工具包括网络图形分析和社交网络服务[57]。最后数据处理技术也可用于流数据。由于数据是从其来源获取的这些技术提供了处理工作流消除了将数据转换为批处理数据的要求[58]。此类工具的示例包括Storm[59]、Flink[60]、Spark Streaming[61]、Samza[62]、Apex[63]和Google Cloud Dataflow[64]等。大数据分析大数据分析被定义为从不同来源获取数据对其进行处理以提取相关模式和见解并将结果分发给适当的利益相关者的过程[6566]。数据分析分为四4种离散类型指的是i对“发生了什么”问题做出回应并从原始数据中挖掘信息的描述性分析ii诊断分析报告过去同时试图回答“为什么会发生这种情况”iii预测分析回答未来相关问题“会发生什么”和“为什么会发生”大数据分析被定义为从不同来源获取数据对其进行处理以提取相关模式和见解并将结果分发给适当的利益相关者的过程[6566]。数据分析分为四4种离散类型指的是i对“发生了什么”问题做出回应并从原始数据中挖掘信息的描述性分析ii诊断分析报告过去同时试图回答“为什么会发生这种情况”iii预测分析回答未来相关问题“会发生什么”和“为什么会发生”EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop EcosystemAbstract: Big Data is a phenomenon that affects today’s world, with new data being generated every second. Today’s enterprises face major challenges from the increasingly diverse data, as well as from indexing, searching, and analyzing such enormous amounts of data. In this context, several frameworks and libraries for processing and analyzing Big Data exist. Among those frameworks Hadoop MapReduce, Mahout, Spark, and MLlib appear to be the most popular, although it is unclear which of them best suits and performs in various data processing and analysis scenarios. This paper proposes EverAnalyzer, a self-adjustable Big Data management platform built to fill this gap by exploiting all of these frameworks. The platform is able to collect data both in a streaming and in a batch manner, utilizing the metadata obtained from its users’ processing and analytical processes applied to the collected data. Based on this metadata, the platform recommends the optimum framework for the data processing/analytical activities that the users aim to execute. To verify the platform’s efficiency, numerous experiments were carried out using 30 diverse datasets related to various diseases. The results revealed that EverAnalyzer correctly suggested the optimum framework in 80% of the cases, indicating that the platform made the best selections in the majority of the experiments.Keywords: Big Data; data management; data collection; data analysis; data processing; Hadoop; MapReduce; Spark; Mahout; MLlib1. IntroductionGlobal internet consumption has increased due to the growth of the Internet of Things (IoT) and the extensive use of social media. As a result of this rise, vast amounts of data have accumulated, which in most of the cases are extremely difficult to be handled. According to Statista [1], the total amount of data consumed globally has increased to 64.2 Zettabytes in 2020, 79 Zettabytes in 2021, and is expected to increase by more than 180 Zettabytes by 2025. At the same time, Forbes [2] estimates that more than 150 Zettabytes of real-time data will be required for analysis by 2025. Companies dealing with structured data have different requirements than companies dealing with unstructured data, according to Forbes, which discovered that over 95% of organizations require assistance in managing unstructured datasets.All this information is referred to as Big Data, which is defined as massive volumes of data collected from multiple sources and formats [3]. Many businesses gather and analyze data from various sources to make better business decisions regarding their customers, market demands, and trends. For these purposes, various Big Data processing and analysis technologies have been created to efficiently extract information from these large datasets in order to successfully evaluate the underlying data [4]. Among those tools, the ones created upon the Apache Hadoop Ecosystem are the most widely used [5]. Hadoop has become one of the most well-known tools in the Information Technology (IT) business and academic environment, due to its capacity to manage huge amounts of data.However, as modern internet users generate massive amounts of unstructured data, the need for memory resources is increasing as well [6], with distributed data processing being a good answer to the demand for increased memory resources [7]. In this regard, two of the most widely used tools for data processing distribution are the open-source tools of MapReduce [8] and Spark [9], which provide effective solutions for processing and analyzing massive amounts of data, while providing useful functions to developers who can easily exploit them via Application Programming Interfaces (APIs) [10]. Both tools are based on the Hadoop Ecosystem, where MapReduce is used to process data in a processing cluster in parallel, whereas Spark is another solution that has been built for clustered data processing [11]. However, Spark’s major purpose is to provide a programming model that can be utilized in any form of Big Data application that is constrained by the MapReduce features, while remaining error tolerant [12]. Spark is not only an alternative to MapReduce, but it also provides a variety of real-time data processing functionalities. The aforementioned tools serve as the basis for the tools of Mahout [13] and MLlib [14], which are used to perform Big Data analysis using Machine Learning (ML) algorithms [15].The purpose of this research is to develop and deploy EverAnalyzer, a flexible Big Data management platform capable of automatically gathering, pre-processing, processing, and analyzing both real-time (i.e., streaming) and stored (i.e., batch) data. Nevertheless, most of the existing Big Data management platforms already support such a pipeline, exploiting, however, off-the-shelf technologies and tools. In addition, these platforms support tools that perform standalone tasks, such as individual data processing or individual data analysis tasks. Hence, using those platforms, specific frameworks are exploited, having their own set of benefits, shortcomings, and limitations. The solution to this problem is the implementation of a system that can comprehend the advantages and disadvantages of the various tools used to manage diverse case datasets for pursuing a processing or analytical activity and identify the optimum tool per case for performing less time-consuming and more efficient actions. EverAnalyzer comes to bridge exactly this gap, providing the innovation that enables its system to automatically recognize which of the underlying data processing (i.e., MapReduce or Spark) and data analysis (i.e., Mahout or MLlib) tools are most suitable for successfully and efficiently processing and analyzing the ingested data. The system’s choice is influenced not only by the amount of data, but also by the execution speed of prior processing and analysis tasks that have been applied on relevant data scenarios. As a result, EverAnalyzer may be applied to a wide range of scenarios, better assisting users in both processing and analytical activities, hence decreasing their overall workload. To verify all of the above, the platform was evaluated through an experiment that assesses EverAnalyzer’s capability to provide empirical suggestions to its users about the best framework to be utilized for the operations that they wish to perform. Data was collected from thirty (30) distinct datasets related to various diseases and conditions in the healthcare sector. The data was pre-processed, processed, and analyzed, while EverAnalyzer provided a suggestion for the most suitable framework (i.e., MapReduce or Spark for processing tasks, and Mahout or MLlib for analysis tasks, respectively) based on the shortest execution time for the requested processing/analysis process. All the framework’s suggestions were gathered and compared with the framework that had the best execution time between the two chosen tools, revealing that EverAnalyzer made a correct recommendation 80% of the time. However, when the number of datasets increased, this percentage appeared to climb monotonically. This means that each performed processing/analysis task trains EverAnalyzer to export better and more representative results. Hence, if the platform uses a larger number of datasets, it is expected that the percentage of correct answers will be increased, raising the overall platform’s accuracy to a percentage greater than 80%.The remainder of this paper is organized as follows. Section 2 offers a detailed summary of the literature review that was conducted to assess meaningful insights for the study, focusing on Big Data and its lifespan, focusing in particular on the processing andanalysis phases. In Section 3, a thorough analysis of how the proposed platform (EverAnalyzer) is designed and built is presented, including the platform’s goals and users as well as its architecture. Section 4 depicts the experimentation results generated by EverAnalyzer, and Section 5 provides an interpretation of the exported results as well as how they can be interpreted in relation to the studied literature. Finally, Section 6 contains the study’s conclusions, limitations, next steps, and future research directions; it also describes future experiments that would be interesting to conduct using EverAnalyzer’s design and implementation guidelines.2. Literature ReviewBig Data is defined as large volumes of data collected from various sources and in various formats [3]. Such data have some specific characteristics (Vs of the data), which primarily refer to data Volume (i.e., data size), Variety (i.e., data format), Velocity (i.e., data production rate), Veracity (i.e., size of data authenticity), Validity (i.e., data validity), Volatility (i.e., time of data validation), and Value (i.e., data usefulness in terms of analysis) [3]. These characteristics indicate that Big Data is challenging to be managed, but when it is properly managed, it may be highly valuable. For this purpose, companies can use Big Data to evaluate and extract important information about their products and customers. However, due to the wide range of their forms and sizes, analyzing them is sometimes a complicated and time-consuming task. At the same time, people are increasingly using the Internet to help them with their everyday activities and entertainment, which causes the amount of collected data to increase year after year.This results in data that may be structured, semi-structured, or even unstructured, making them difficult to manage with traditional Relational Database Management Systems (RDBMS), which are expensive and time-consuming to implement [16]. Structured data refers to data that are known for the information they contain and the manner in which they are contained. Semi-structured data, on the other hand, lacks some specifications about the information they contain, whereas unstructured data conveys no information on their structure. Large amounts of these data can be collected by mobile phones, sensors, Global Positioning System (GPS) signals, social media, and other sources that generate massive amounts of data every second [17]. As a result, Big Data refers to either batch data deriving from ready-to-use datasets that require some processing or analytic activities (e.g., already stored data derived from external systems’ databases), or streaming data derived from live sources that are constantly streaming information (e.g., realtime data gathered from social media) [18].As a result, managing Big Data throughout their lifecycle has become a very challenging task that never ceases to pique the interest of enterprises and researchers. More specifically, the utilization of Big Data is represented by a lifecycle that includes a plethora of phases, beginning with collection of the data and concluding with their final destruction [19]. Figure 1 depicts all of these phases, referring to the: (i) collection, in which data are collected from various sources, most of the time in formats that are difficult to handle due to their unstructured nature; (ii) storage, in which the ingested data are stored in the appropriate database; (iii) processing, in which data are pre-processed in a standard structure to make it easier to manage in subsequent phases; (iv) analysis, in which various ML methods are used to produce meaningful results and insights from the stored data; (v) utilization, in which the extracted results and gained insights are put to use in a variety of real-life and testing scenarios; (vi) destruction, the final and most important phase of the entire lifecycle, since many sensitive data may be collected from various sources during the collection phase, requiring the data’s compliance to a strict protocol to ensure that their confidentiality, integrity, and availability are not compromised. To this end, it should be emphasized that the suggested platform’s purpose is to investigate the phases of collection, storage, processing, and analysis, which are further analyzed below.2.1. Big Data CollectionBig Data collection is described as the process of gathering massive amounts of data in order to further analyze them and obtain useful results [20,21]. These data can be collected using traditional methods such as questionnaires and interviews; however, there is a plethora of more effective approaches. Web services, sensor-equipped devices such as mobile phones and tablets, and smart transportation cards, are just a few examples [22]. All the data collected from these devices may be either batch, meaning that they are collected up to a predefined size and then stored all together to be analyzed later as a set of data, or streaming, referring to data that are analyzed while being collected. The distinction between those two kinds of data is that streaming data processing is applied directly to the ingested data, whereas batch data processing collects and preprocesses a predetermined quantity of data [18]. Furthermore, if it is not possible to collect enough data for a processing/analytical activity, there are methods for creating synthetic data [23], which represent the real data that an analysis would most likely use to properly execute the required analysis.Various tools, such as Sebek [24], Hflow [25], Honeywall [26], Nepenthes [27], Kojoney [28], and Capture-HPC [29] have been built to successfully collect such varied types and formats of data. Kafka [30] and Flume [31] are two of the most widely used data collection tools. Whereas Kafka is a streaming data collection and processing tool, Flume is primarily used to manage infrastructures for collecting streaming data as batch data. Flafka is created by combining those two tools, providing the ability to save streaming data as batch data exploiting both Kafka and Flume [32].2.2. Big Data StorageBig Data storage is described as the process of storing and managing large-scale datasets while maintaining data access reliability and availability [33,34]. Big Data storage has a significant impact on the infrastructure of the system that desires to adopt it. On the one hand, the storage infrastructure must provide reliable space to storage services, but on the other hand, it must also provide a dynamic access interface for querying and analyzing large amounts of data.Because the volume of Big Data is continuously expanding, complex systems known as Database Management Systems (DBMS) are increasingly being employed to store and manage these data. Structured Query Language (SQL) systems and Non-SQL (NoSQL) systems are the two representative types of RDBSs [35]. NoSQL systems are preferable for storing and managing Big Data, since SQL systems require organized data to be efficient, whilst NoSQL systems are meant to be used for unstructured data. To better manage the variety of the forms of the existing unstructured data, NoSQL DBMSs are classified into three separate core categories, namely: (i) key-value stores that store data as a collection of key-value pairs in which a key serves as a unique identifier, with both keys and values ranging from simple objects to complex compound objects (e.g., Redis [36]; Scalaris [37], Tokyo Tyrant [38], Riak [39]); (ii) document stores that are databases for storing information in the form of documents (e.g., SimpleDB [40], CouchDB [41], MongoDB [42], Terrastore [43]); (iii) column stores that use tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table (e.g., Bigtable [44], HBase [45], HyperTable [46], Cassandra [47]).2.3. Big Data ProcessingBig Data processing is a group of techniques for accessing large amounts of data in order to extract meaningful information for decision support and provision [48,49]. Big Data processing employs a range of methods, such as wordcount and string matching, which can be distributed across vast clusters of processing units [50]. Data processing algorithms typically have low algorithmic complexity, allowing them to perform quick computations. They are simple to implement and can interpret a variety of datasets, whereas they may be used on any dataset, regardless of its size, due to their high speed. However, directly obtained datasets (i.e., raw datasets) are frequently impossible to process as a data processing task, since in the case of Big Data such datasets do not comply with a specific structure as they derive from a broad range of sources. Thus, Big Data must first go through a data pre-processing phase to normalize the data structure before going through a data processing job. After the data structure is normalized, it is then simple to process the data using the preferred data processing algorithms.At the same time, traditional programming paradigms are incapable of handling data effectively because it is often stored on thousands of commodity servers. As a result, new parallel programming methods are being deployed in datacenters to improve the performance of NoSQL databases [48]. MapReduce is a popular programming model for Big Data processing on large-scale commodity clusters, and it has evolved as an important component of the Hadoop ecosystem [48]. The main advantage of this programming model is its simplicity, which allows its users to easily exploit it for Big Data processing tasks [51]. Pig is an SQL-like environment that is used for performing processing tasks upon Big Data [52], whereas Hive is another example of such tool that provides a better environment than MapReduce and simplifies the code development as programmers are not required to deal with the complexities of MapReduce coding [53]. Similarly, many solutions have been developed to address MapReduce’s gaps, such as delayed data loading and data reuse. Among those tools are Starfish, which is a Hadoop-based framework aiming to improve the performance of MapReduce jobs through the use of data lifecycle analytics, as well as being a self-tuning system that adapts to users’ needs and systems’ workloads without requiring users to configure or change the underlying settings or parameters [54]. Spark is an alternative to MapReduce that aims to overcome disk I/O limitations and improve the performance of prior solutions. The ability to perform in-memory computations is the main feature that distinguishes Spark, since it enables data to be cached in memory, removing the disk overhead limitation of MapReduce for iterative tasks [55]. Other programming models similar to MapReduce include Dryad, which is a distributed execution engine for running Directed Acyclic Graph-based (DAG) Big Data applications. While MapReduce only allows for a single set of input and output data, Dryad allows users to use any number of input and output data [56]. Pregel is another tool capable of processing large-scale graphs for a variety of purposes, including network graph analysis and social networking services [57]. Finally, data processing technologies are available for streaming data as well. As data is acquired from their source, these technologies provide processing workflows, removing the requirement to convert data to batch data [58]. Examples of such tools are Storm [59], Flink [60], Spark Streaming [61], Samza [62], Apex [63], and Google Cloud Dataflow [64], among others.2.4. Big Data AnalysisBig Data analysis is defined as the procedure for acquiring data from diverse sources, processing them to extract relevant patterns and insights, and distributing the results to the appropriate stakeholders [65,66]. Data analysis is classified into four (4) discrete types, which refer to: (i) descriptive analytics that respond to the question “What happened?” and mines information from raw data; (ii) diagnostic analytics that report on the past while attempting to answer the question “Why did it happen?”; (iii) predictive analytics that answer future-related questions “What will happen?” and “Why will it happen?”;Big Data analysis is defined as the procedure for acquiring data from diverse sources, processing them to extract relevant patterns and insights, and distributing the results to the appropriate stakeholders [65,66]. Data analysis is classified into four (4) discrete types, which refer to: (i) descriptive analytics that respond to the question “What happened?” and mines information from raw data; (ii) diagnostic analytics that report on the past while attempting to answer the question “Why did it happen?”; (iii) predictive analytics that answer future-related questions “What will happen?” and “Why will it happen?”;

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询