550 likes | 1.33k Vues
Big Data. Cross 11, Tapovan Enclave Nala pani Road, Dehradun 248001 Email: info@iskd.in Contact : +918979066357, +919027669947. Big Data. Introduction
E N D
Big Data Cross 11, Tapovan Enclave Nalapani Road, Dehradun 248001 Email: info@iskd.inContact : +918979066357, +919027669947
Big Data Introduction "Big data" is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processingapplication software. Data with many cases (rows) offer greater statistical power, while data with higher complexity) may lead to a higher false discovery rate. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying,updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. Other concepts later attributed with big data are veracity (i.e., how much noise is in the data) and value.
Characteristics of big data Machine learning: big data often doesn't ask why and simply detects patterns Digital footprint: big data is often a cost-free byproduct of digital interaction • Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc. • Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors. Big data can be described by the following characteristics Volume • The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. Variety • The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion
VelocityBig data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.VeracityIt is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis. Data must be processed with advanced tools (analytics and algorithms) to reveal meaningful information. For example, to manage a factory one must consider both visible and invisible issues with various components. Information generation algorithms must detect and address invisible issues such as machine degradation, component wear, etc. on the factory floor.
Technologies • Techniques for analyzing data, such as A/B testing, machine learning and natural language processing • Big data technologies, like business intelligence, cloud computing and databases • Visualization, such as charts, graphs and other displays of the data • Multidimensional big data can also be represented as data cubes or, mathematically, tensors. Array Database Systems have set out to provide storage and high-level query support on this data type. Additional technologies being applied to big data include efficient tensor-based computation,[58] such as multilinear subspace learning., massively parallel-processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed databases, cloud and HPC-basedinfrastructure (applications, storage and computing resources) and the Internet.[citation needed] Although, many approaches and technologies have been developed, it still remains difficult to carry out machine learning with big data.
Applications • Big data has increased the demand of information management specialists so much so that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year about twice as fast as the software business as a whole. • Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people became more literate, which in turn led to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and predictions put the amount of internet traffic at 667 exabytes annually by 2014.According to one estimate, one-third of the globally stored information is in the form of alphanumeric text and still image data,which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content). • While many vendors offer off-the-shelf solutions for big data, experts recommend the development of in-house solutions custom-tailored to solve the company's problem at hand if the company has sufficient technical capabilities.
GovernmentThe use and adoption of big data within governmental processes allows efficiencies in terms of cost, productivity, and innovation,but does not come without its flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome.ManufacturingBased on TCS 2013 Global Trend Study, improvements in supply planning and product quality provide the greatest benefit of big data for manufacturing. Big data provides an infrastructure for transparency in manufacturing industry, which is the ability to unravel uncertainties such as inconsistent component performance and availability. Predictive manufacturing as an applicable approach toward near-zero downtime and transparency requires vast amount of data and advanced prediction tools for a systematic process of data into useful information. HealthcareBig data analytics has helped healthcare improve by providing personalized medicine and prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability reduction, automated external and internal reporting of patient data, standardized medical terms and patient registries and fragmented point solutions.
MediaTo understand how the media utilizes big data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitionersin Media and Advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead taps into consumers with technologies that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's mindset. InsuranceHealth insurance providers are collecting data on social "determinants of health" such as food and TV consumption, marital status, clothing size and purchasing habits, from which they make predictions on health costs, in order to spot health issues in their clients. It is controversial whether these predictions are currently being used for pricing.
Internet Of ThingsBig data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device interconnectivity. Such mappings have been used by the media industry, companies and governments to more accurately target their audience and increase media efficiency. IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data has been used in medical, manufacturing and transportation contexts.
Case Studies India Big data analysis was tried out for the BJPto win the Indian General Election 2014. TheIndian government utilizes numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation. Israel A big data application was designed by Agro Web Lab to aid irrigation regulation. Personalized diabetic treatments can be created through GlucoMe's big data solution. United Kingdom Examples of uses of big data in public services Data on prescription drugs- by connecting origin, location and the time of each prescription, a research unit was able to exemplify the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new or most up-to-date drugs take some time to filter through to the general patient. Joining up data- a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as 'meals on wheels'. The connection of data allowed the local authority to avoid any weather-related delay.
United State Of AmericaBig data analysis played a large role in BarackObama's successful 2012 re-election campaign.The United States Federal Governmentowns five of the ten most powerful supercomputers in the world
Sports • Big data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics. Future performance of players could be predicted as well. Thus, players' value and salary is determined by data collected throughout the season. • In Formula One races, race cars with hundreds of sensors generate terabytes of data. These sensors collect data points from tire pressure to fuel burn efficiency. Based on the data, engineers and data analysts decide whether adjustments should be made in order to win a race. Besides, using big data, race teams try to predict the time they will finish the race beforehand, based on simulations using data collected over the season.
Technology • eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. • Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. • Facebook handles 50 billion photos from its user base. As of June 2017, Facebook reached 2 billion monthly active users. • Google was handling roughly 100 billion searches per month as of August 2012.
Sampling Big Data An important research question that can be asked about big data sets is whether you need to look at the full data to draw certain conclusions about the properties of the data or is a sample good enough. The name big data itself contains a term related to size and this is an important characteristic of big data. But Sampling (statistics) enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. For example, there are about 600 million tweets produced every day. Is it necessary to look at all of them to determine the topics that are discussed during the day? Is it necessary to look at all the tweets to determine the sentiment on each of the topics? In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage and controller data are available at short time intervals. To predict downtime it may not be necessary to look at all the data but a sample may be sufficient. Big Data can be broken down by various data point categories such as demographic, psychographic, behavioral, and transactional data. With large sets of data points, marketers are able to create and utilize more customized segments of consumers for more strategic targeting. There has been some work done in Sampling algorithms for big data. A theoretical formulation for sampling Twitter data has been developed.
Big Data Ethics Big Data Ethics also known as simply Data Ethics refers to systemising, defending, and recommending concepts of right and wrong conduct in relation to data, in particular personal data. Since the dawn of the Internet the sheer quantity and quality of data has dramatically increased and is continuing to do so exponentially. Big data describes this large amount of data that is so voluminous and complex that traditional data processing application software is inadequate to deal with them. Recent innovations in medical research and healthcare, such as high-throughput genome sequencing, high-resolution imaging, electronic medical patient records and a plethora of internet-connected health devices have triggered a data deluge that will reach the exabyte range in the near future. Data Ethics is of increasing relevance as the quantity of data increases because of the scale of the impact.
Data Ethics is concerned with the following principles1. Ownership - Individuals own their own data.2. Transaction Transparency - If an individuals personal data is used, they should have transparent access to the algorithm design used to generate aggregate data sets 3. Consent - If an individual or legal entity would like to use personal data, one needs informed and explicitly expressed consent of what personal data moves to whom, when, and for what purpose from the owner of the data.4. Privacy - If data transactions occur all reasonable effort needs to be made to preserve privacy. 5. Currency - Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions. 6. Openness - Aggregate data sets should be freely availableOwnership
Big Data Maturity Model Big Data Maturity Models (BDMM) are the artifacts used to measure Big Data maturity. These models help organizations to create structure around their Big Data capabilities and to identify where to start.They provide tools that assist organizations to define goals around their big data program and to communicate their big data vision to the entire organization. BDMMs also provide a methodology to measure and monitor the state of a company’s big data capability, the effort required to complete their current stage or phase of maturity and to progress to the next stage. Additionally, BDMMs measure and manage the speed of both the progress and adoption of big data programs in the organization. The goals of BDMMs are • To provide a capability assessment tool that generates specific focus on big data in key organizational areas • To help guide development milestones • To avoid pitfalls in establishing and building big data capabilities
Data analysisData analysis is a process of inspecting, cleansing, transforming, and modelingdatawith the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.
Modeling and algorithms • Mathematical formulas or models called algorithms may be applied to the data to identify relationships among the variables, such as correlation or causation. In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model + Error). • Inferential statistics includes techniques to measure relationships between particular variables. For example, regression analysismay be used to model whether a change in advertising (independent variable X) explains the variation in sales (dependent variable Y).
Data Product A data product is a computer application that takes data inputs and generates outputs, feeding them back into the environment. It may be based on a model or algorithm. An example is an application that analyzes data about customer purchasing history and recommends other purchases the customer might enjoy.
Data Curation • Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data". In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database. • In the modern era of big data the curation of data has become more prominent, particularly for software processing high volume and complex data systemsThe term is also used in historical uses and the humanities,where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation.
Data Defined Storage • Data defined storage (also referred to as a data centric approach) is a marketing term for managing, protecting, and realizing value from data by uniting application, information and storage tiers. This is achieved through a process of unification, where users, applications and devices gain access to a repository of captured metadata that empowers organizations to access, query and manipulate the critical components of the data to transform it into information, while providing a flexible and scalable platform for storage of the underlying data. The technology Core technology • Data defined storage focuses on metadata with an emphasis on the content, meaning and value of information over the media, type and location of data. Data centric management enables organizations to take a single, unified approach to managing data across large, distributed locations
Data Lineage • Data lineage includes the data's origins, what happens to it and where it moves over time.Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process. • It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. Database systems use such information, called data provenance, to address similar validation and debugging challenges. Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis. "Lineage is a simple type of why provenance." • Data lineage can be represented visually to discover the data flow/movement from its source to destination via various changes and hops on its way in the enterprise environment, how the data gets transformed along the way, how the representation and parameters change, and how the data splits or converges after each hop.
Data Philanthropy Data philanthropy describes a form of collaboration in which private sector companies share data for public benefit. There are multiple uses of data philanthropy being explored from humanitarian, corporate, human rights, and academic use. A large amount of data collected from the Internet comes from user-generated content. This includes blogs, posts on social networks, and information submitted in forms. Besides user-generated data, corporations are also currently data mining data from consumers in order to understand customers, identify new markets, and make investment decisions
Data Quality Data quality refers to the condition of a set of values of qualitative or quantitative variables. There are many definitions of data quality but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making and planning".Alternatively, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as data volume increases, the question of internal data consistencybecomes significant, regardless of fitness for use for any particular external purpose. People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. Data cleansing may be required in order to ensure data quality.
Data Quality Control Data quality control is the process of controlling the usage of data for an application or a process. This process is performed both before and after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction.
Data Science • Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledgeand insights from structured and unstructured data. Data science is the same concept as data mining and big data: "use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems". • Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.
Datafication Datafication is a modern technological trend turning many aspects of our life into data which is subsequently transferred into information realised as a new form of value. Kenneth Cukier and Victor Mayer-Schöenberger introduced the term Datafication to the broader lexicon in 2013.Up until this time, Datafication had been associated with the analysis of representations of our lives captured through data, but not on such a scale that we now see. This change was primarly due to the impact of big data and the computational opportunities afforded to Example Human Resources Data obtained from mobile phones, apps or social media usage is used to identify potential employees and their specific characteristics such as risk taking profile and personality. This data will replace personality tests. Rather using the traditional personality tests or the exams that measure the analytical thinking, using the data obtained through datafication will change existing exam providers. Also, with this data new personality measures will be developed. Insurance and Banking Data is used to understand an individual’s risk profile and likelihood to pay a loan. Customer relationship ManagementVarious industries are using datafication to understand their customers better and create appropriate triggers based on each customer’s personality and behaviour. This data is obtained from the language and tone a person uses in an email, phone call or social media
Design Methods In the bottom-up approach, data marts are first created to provide reporting and analytical capabilities for specific business processes. These data marts can then be integrated to create a comprehensive data warehouse. The data warehouse bus architecture is primarily an implementation of "the bus", a collection of conformed dimensions and conformed facts, which are dimensions that are shared (in a specific way) between facts in two or more data marts. Top-down design The top-down approach is designed using a normalized enterprise data model. "Atomic" data, that is, data at the greatest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. Hybrid design Data warehouses (DW) often resemble the hub and spokes architecture. Legacy systemsfeeding the warehouse often include customer relationship management and enterprise resource planning, generating large amounts of data. To consolidate these various data models, and facilitate the extract transform load process, data warehouses often make use of an operational data store, the information from which is parsed into the actual DW.
In-memory processing Incomputer science, in-memory processingis an emerging technology[citation needed] for processing of data stored in an in-memory database. Older systems have been based on disk storage and relational databases using SQL query language, but these are increasingly regarded as inadequate to meet business intelligence (BI) needs. Because stored data is accessed much more quickly when it is placed in random-access memory(RAM) or flash memory, in-memory processing allows data to be analysed in real time, enabling faster reporting and decision-making in business.
Big Data It Company • This is an alphabetical list of notable IT companies using the marketing term big data: • Alpine Data Labs, an analytics interface working with Apache Hadoop and big data • Azure Data Lake is a highly scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud • Big Data Partnership, a professional services company based in London • Big Data Scoring, a cloud-based service that lets consumer lenders improve loan quality and acceptance rates through the use of big data • BigPanda, a technology company headquartered in Mountain View, CaliforniaCompuverde, an IT company with a focus on big data storage • CtrlShift, a Singapore-headquartered programmatic marketing company • CVidya, a provider of big data analytics products for communications and digital service providers • Cybatar • Databricks, a company founded by the creators of Apache Spark • Dataiku, a French computer software company • DataStax • Domo • Fluentd • Flytxt • Greenplum
Small Data • Small data is data that is 'small' enough for human comprehension. It is data in a volume and format that makes it accessible, informative and actionable. • The term "big data" is about machines and "small data" is about people. This is to say that eyewitness observations or five pieces of related data could be small data. Small data is what we used to think of as data. The only way to comprehend Big data is to reduce the data into small, visually-appealing objects representing various aspects of large data sets (such as histogram, charts, and scatter plots). Big Data is all about finding correlations, but Small Data is all about finding the causation
Statistics • Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation. In applying statistics to, for example, a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics. • When census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole.
Machine Learning Machine Learning models are statistical and probabilistic models that captures patterns in the data through use of computational algorithms. Statistics in society • Statistics is applicable to a wide variety of academic disciplines, including natural and social sciences, government, and business. Statistical consultants can help organizations and companies that don't have in-house expertise relevant to their particular questions.
Surveillance Capitalism • Surveillance capitalism has a number of meanings around the commodification of personal information. Since 2014, sociologist ShoshanaZuboff has used and popularized the term. Background • Economic pressures of capitalism are driving the intensification of connection and monitoring online with spaces of social life becoming open to saturation by corporate actors, directed at the making of profit and/or the regulation of action. Relevantly, Turow writes that "centrality of corporate power is a direct reality at the very heart of the digital age". :17 Capitalism has become focused on expanding the proportion of social life that is open to data collection and data processing. This may come with significant implications for vulnerability and control of society as well as for privacy.
Urban Informatics • Urban informatics refers to the study of people creating, applying and using information and communication technologyanddata in the context of cities and urban environments. Various definitions are available, some provided in the Definitionssection. Urban informatics is a trans-disciplinary field of research and practice that draws on three broad domains: people, place and technology. • People can refer to city residents, citizens, community groups, from various socio-cultural backgrounds, as well as the social dimensions of non-profit organisations and businesses. The social research domains that urban informatics draws from include urban sociology, media studies, communication studies, cultural studies, city planning and others.
Further readingSince Foth's 2009 "Handbook of Research on Urban Informatics", a number of books and special issues of academic journals have been published on the topic, which further demonstrate the increasing significance and notability of the field of urban informatics.
Very Large Database • This article is about databases which are very large. For the VLDB conference, see International Conference on Very Large Data Bases. • A very large database, (originally written very large data base) or VLDB, is a database that contains a very large amount of data, so much that it can require specialized architectural, management, processing and maintenance methodologies . Definition • The vague adjectives of very and large allow for a broad and subjective interpretation, but attempts at defining a metric and threshold have been made. Early metrics were the size of the database in a canonical form via database normalization or the time for a full database operation like a backup. Technology improvements has continually changed what is considered very large.
There is no absolute amount of data that can be cited. For example, one cannot say that any database with more than 1 TB of data is considered a VLDB. This absolute amount of data has varied over time as computer processing, storage and backup methods have become better able to handle larger amounts of data. That said, VLDB issues may start to appear when 1TB is approached, and are more than likely to have appeared as 30TB or so is exceeded
XLDB XLDB (eXtremely Large Data Bases) is a yearly conference about data processing. The definition of extremely large refers to data sets that are too big in terms of volume (too much), and/or velocity (too fast), and/of variety (too many places, too many formats) to be handled using conventional solutions. The main goals of this community include • Identify trends, commonalities and major roadblocks related to building extremely large databases • Bridge the gap between users trying to build extremely large databases and database solution providers worldwide • Facilitate development and growth of practical technologies for extremely large data stores