1 / 22

What Is Big Data?

What Is Big Data?. Craig C. Douglas University of Wyoming. What Is Big Data?... It Depends. What Is Big Data?... It Depends. What if time counts? Given a time period t, How much data can be read and written? This changes over time as technology changes .

phuc
Télécharger la présentation

What Is Big Data?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What Is Big Data? Craig C. DouglasUniversity of Wyoming

  2. What Is Big Data?... It Depends

  3. What Is Big Data?... It Depends • What if time counts? • Given a time period t, • How much data can be read and written? • This changes over time as technology changes. • What if the quantity of data counts? • How long does it take to read and write data? • This changes over time as technology changes. • Definition of Big Data is fluid, not static.

  4. Some Sources of Big Data • Interactions with dynamic databases • Internet data • City or regional transportation flow control • Environment and disaster management • Oil/gas fields or pipelines, seismic imaging • Credit cards and online businesses • Government or industry regulation/statistics • Dynamic data-driven apps

  5. Why is Big Data a Hot Topic? • Open positions in data analytics by 2020 (USA) • up to 200,000 open positions • might only be 140,000 open positions • Bureau of Labor Statistics projects that 70% of all newly created jobs across all STEM fields during 2010’s, • across engineering, the physical sciences, the life sciences, and the social sciences, • will be in computer science

  6. Unprecedented Opportunities • Significant contributions to the development of these transformative technologies have been made from diverse fields including: • mathematics, • natural sciences • engineering • social sciences • arts and entertainment industries • business world

  7. Unprecedented Opportunities • Algorithm and software development belong to computer science over the past 50 years: • Computer science researchers have designed and implemented the algorithms and data structures, languages, models, tools, and abstractions that have enabled these transformational technology developments

  8. Quick summary • Simulation oriented computational science is transformational science, but is only a niche in the grand scheme of things. • Big data computing capabilities must be broadly available in any institution that strives to compete in the coming decade. • If not, an institution will simply cease to be competitive, similar to not joining the ARPAnet or CSnet in the 1970’s and 1980’s.

  9. Similarities in Sentences in Big Files

  10. Big File Format • One line per sentence with no punctuation • Each word is separated by one blank • All lower case • Multiple languages and gibberish • Watch for an extra blank at end of some lines

  11. Goals • In the big file of sentences: • Eliminate similar sentences • Find similar sentences of some distance or less • Either goal is hard work if the file has enough sentences • Both goals of about the same hardness • Methods in Chapter 3 of Ullman et al’s Data Mining book useful

  12. Goal 1 • Eliminate all duplicate lines (distance 0) • Eliminate all sentences of distance 1 • Two sentences S1 and S2 are distance n if S1 can be transformed into S2 by adding, removing, or substituting at most n words. • What happens if you eliminate sentence Si because of sentence Si-j, but you later find a sentence Sk that has distance 0 or 1 from Si? • Need to define how you handle this case.

  13. Goal 2 • List all sentences that have duplicates. • List all sentences that have distance 1 sentences • List first one followed by all distance 0 or 1 sentences related to it • Can do as separate lists or just one • Should be sorted • Redo for distance n

  14. Preprocessing • Read all of the file and build a dictionary with each word given a natural number as an index: • Given sentence one here as the first one • 1 2 3 4 5 6 3 7 • Next sentence after sentence one • 8 2 9 2 3 • And so on • 10 11 12

  15. Implementation Suggestions • Use hash tables of considerable size • Hash table size should be a prime number • Build and debug your code with small files • Start with < 10 sentences • Next try 100, 1000, and 10,000 sentences • Then try 17,788,002 sentences • Consider using Hadoop (requires knowledge of Java, however) or MR-MPI (C/C++)

  16. Tricky Part • Build a code to do Goal 1 or 2. Notes: • Shingling and minhash do not work well for edit distance • Two approaches: • Try Jaccard similarity or distance methodology on sentences considered as sets of words • Modify index-based and length-based methods

  17. Generalizing • Substitute n for 1 • Not much extra work to do so • Instead of looking at sentences of word length difference 1, look at ones of difference up to n • Makes a much more useful program • Take arbitrary sentences • Convert to one per line, each word separated by one blank • Take lower and upper case into account and convert to all lower case as preprocessing

  18. Some Interesting Problems • An Open Source, secure Hadoop replacement suitable for hospitals and medical records. • Must be HPPA compliant. • Must scale well for very large databases. • Must have individual access capabilities. • Must not have complexity O(disk access) on a DFS. • Should use OpenMP and MPI. • Should use cache aware hashing methods. • Will be useful well beyond medical records.

  19. Some Interesting Problems • Dynamic Data-Driven Application Systems and Big Data • A natural fit and there is no agreed upon softwarefor DDDAS or DDDAS-BD or DBDDAS. DDDAS has been applied to many, many fields. • DDDAS researchers agree something should be produced: not considered an application and too applied to be considered networking research. • Need to find a niche or a program officer in a funding agency willing to think outside of the box. • Many Big Data issues long common to DDDAS.

  20. Some Interesting Problems • Sensors and telemetry • SensorML was supposed to provide a standard way of describing sensor data and be able to get the data and deliver it to applications. It went commercial ($$$...$$$) after the original PI retired. • A true Open Source, internationally recognized standard would benefit one area of Big Data and DDDAS.

  21. Some Interesting Problems • Reservoirs (oil, gas, water) • Dynamic reservoir meshing • Vertical wells with micro sensors provide updates to fracked reservoirs. • Speed up the meshing to including in a reservoir simulator time (e.g., go from a year to a day). • Dynamically improve predictions. • Corporate oil/gas fields or pipelines (even small ones) produce excessive amounts of data • Open Source data mining tools for specific problem

  22. Some Interesting Problems • Audio and photographic data mining • World’s largest databases based on VoIP and phone monitoring by many governments (e.g., P.R. China, France, Germany, Kingdom of Saudi Arabia, United Kingdom, USA, …). • Keeps disk drive makers in business and lowers hard disk prices very significantly. • Another problem: Find all file duplicates in a file system efficiently. Similar to sentence problem earlier. • Has commercial (e.g., Bing, satellite transmission) and research ramifications that are not nefarious.

More Related