Transcription

“Civilization advances by extending thenumber of important operations which wecan perform without thinking about them”Alfred North Whitehead, Professor atHarvard, 1910s1

Hands-on H4MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. LlorenteSpring Term 20202

Before We StartWhere We AreComputing Foundations for Computational and Data ScienceHow to use modern computing platforms in solving scientific problemsIntro: Large-Scale Computational and Data ScienceA. Parallel Processing FundamentalsB. Parallel ComputingC. Parallel Data ProcessingC1. Batch Data ProcessingC2. Dataflow ProcessingC3. Stream Data ProcessingWrap-Up: Advanced TopicsLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente3

CS205: ContentsAPPLICATION SOFTWAREApplication ParallelismProgram DesignOpenACCOptimizationOpenMPMPISparkProgramming ModelSlurmPlatformBIG DATABIG COMPUTEApplication SoftwareMap-ReduceYarnArchitectureCloud ComputingComputing ClusterLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente4

Before We StartWhere We AreConceptsPlatformProgrammingWeek 9: Batch Data Processing MapReduce3/233/243/253/26Lecture C1Lab I8Hands-on H4Batch DataProcessingMapReduceHadoop ClusterMapReduceProgramming3/27(Quiz & Reading)Week 10: Dataflow Processing Spark3/303/314/14/2Lecture C2Lab I9Hands-on H5DataflowProcessingSpark SingleNodeSparkProgramming4/3(Quiz & Reading)Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente5

ContextMapReduce Programming ModelThe programmer essentially only specifies two (sequential) functionsSTEP 1. MAP: map(k1,v1) list(k2,v2) Inputs data record and outputs a set of intermediate key-value pairs, each oftype k2 and v2 Types can be simple or complex user-defined objects Each map call is fully independent (no execution ordering, sync or comm)STEP 2. SUFFLING: Internal grouping of all intermediate pairs with same keytogether and passes them to the workers executing reduceSTEP 3. REDUCE: reduce(k2,list(v2)) list(k3,v3) Combines information across records that share this same intermediate key Each reduce call is fully independentLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente6

ContextMapReduce Programming ModelLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente7

Hands-on ExamplesRequirementsBoth the mapper and the reducer should be python executable scripts thatread the input from stdin (line by line) and emit the output to stdout cat files mapper.py sort reducer.py1. Unix-like shell (Linux, Mac OS or Windows/Cygwin)2. Python installedLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente8

RoadmapMapReduce Design PatternsDesign PatternsSummarizationInverted IndexFilteringOther PatternsLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente9

Design PatternsWhat Are Design Patterns?ü Reusable solutions to problems (HWC!)ü Domain independentü Not a cookbookü Not a guideü Not a finished solutionLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente10

Design PatternsWhy Design Patterns?ü Makes the intent of model and platformeasier to understandü Provides a common language for solutionsü Be able to reuse codeü Describes known performance profiles andlimitations of solutionsLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente11

Design PatternsWhen Should I Use MapReduce?Query Index and Search: inverted index Filtering ClassificationAnalytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. . . large datasets in off-line mode for boosting otheron-line processesLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente12

Design PatternsMain Functions and PatternsMain Patterns1.Summarization2.Inverted Index3.FilteringLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente13

SummarizationCalculating Aggregate Statistical ValuesDescription A general pattern for calculating aggregate statistical values overyour dataIntent Group records together by a key field and calculate a numericalaggregate per group to get a top-level view of the larger data setExamples1. Word count2.Record count3.Min/Max/Count4.Average/Median/Standard deviation5. .Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente14

SummarizationWord CountFind the frequency of each word in text files Map: Process lines and generate as output word, 1 Reduce: Add all values for the same wordmapreduceinput: [line of text file]for each wordoutput: word, 1 input: [ word, 1 ]count for same wordoutput: word, sum Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente15

SummarizationWord Countmapper.pyreducer.pyLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente16

SummarizationRecord CountFind the frequency of each URL in web logs Map: Process web page access logs and generate URL, 1 asoutput Reduce: Add all values for the same URL64.242.88.10 - - [07/Mar/2004:16:37:27 -0800] "GET /twiki/bin/view/TWiki/DontNotify HTTP/1.1" 200 414064.242.88.10 - - [07/Mar/2004:16:39:24 -0800] "GET /twiki/bin/view/Main/TokyoOffice HTTP/1.1" 200 3853 mapreduceinput: [line of log file]input: [ URL, 1 ]for each line with a URLCount for same URLoutput: URL, 1 output: URL, # Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente17

SummarizationMax-MinGiven a list of tweets determine first and last time an user commentedand the number of times. Data is a set of lines username, date, text Peter [07/Mar/2020:16:39:24 -0800] “Stay at home”John [07/Mar/2020:16:39:25 -0800] “Me too” mapreduceinput: [ username, date,text ]for each lineoutput: username, date,1 input: [ username, date, 1 ]First, Last and Count forsame usernameoutput: username,first date, last date Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente18

SummarizationAverageFind average daily gains in stock for each company Data is a set of lines date, company, start price, end price This example is for company from 1/1/2000 – et,153.302917,159.870193,159.621811 mapreduceinput: [ date, company,start price, end price ]if date matchesoutput: [ company,end price-start price ]input: [ company,end price-start price ]Average for same companyoutput: company,average Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente19

Inverted IndexMapping Content to LocationDescription A general pattern for mapping content to locations such as wordsor numbers, to its locations in a database file or in a document or aset of documentsIntent Most of the text searching systems rely on inverted index to searchthe documents that contains a given word or a termLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente20

Inverted IndexWord to DocumentsFind what documents contain a specific word Map: Parse document and generate word, doc id pairs Reduce: For each word, sort the corresponding document IDsall id 432, id 76also id 432 mapreduceinput: [line from documentdoc id]for each wordoutput: word, doc id input: [ word, doc id ]concatenate for same wordoutput: word, [doc ids] Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente21

Hands-onWord to Documents – Inverted Indexü Implement word to documentsü Adapt mapper and reducer from wordcountü Pre-create a file with the file name as firstitem in each linehttps://goo.gl/dX1Kn7ü Extend to see the number of occurrences perfileLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente22

Inverted IndexReverse Web-link GraphFind where page links come from Map: Output target, source for each link to target in a page source Reduce: Concatenate the list of all source URLs associated with a targetURL sourcesXxxURL targetYyyzzzURL target, URL sourcesmapreduceinput: [line of HTML fileURL source]input: [ URL target,URL source ]for each URL targetconcatenate for sameURL targetoutput: URL target,URL source output: URL target,[URL sources] Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente23

FilteringFiltering Out RecordsDescription It evaluates each record separately and decides, based on somecondition, whether it should stay or goIntent Filter out records that are not of interest and keep ones that are.Examples1. Closer view of dataset2.Data cleansing3.Tracking a thread of events4.Simple random sampling5.Distributed Grep6.Removing low scoring dataset7.Log Analysis8.Data Querying and Validation9. Lecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente24

FilteringDistributed GrepSearch for words in a document Map: Generate a line if it matches a given pattern Reduce: Just copy the intermediate data to the outputmapreduceinput: [line of text file]if pattern matchesoutput: “”, line input: [ “”, line ]output: lineLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente25

Other PatternsOrganization, Join and Input/Outputü Summarization patterns: Get a top-level view by summarizing and grouping dataü Filtering patterns: View data subsets such as records generated from one userü Data organization patterns: Reorganize data to work with other systems, or to makeMapReduce analysis easierü Join patterns: Analyze different datasets together to discover interestingrelationshipsü Metapatterns: Piece together several patterns to solve multi-stage problems, or toperform several analytics in the same jobü Input and output patterns: Customize the way you use Hadoop to load or store dataLecture H4. MapReduce Design PatternsCS205: Computing Foundations for Computational ScienceDr. Ignacio M. Llorente26

Next Steps Get ready for next lecture:C2. Dataflow Processing (Tuesday 3/31)27

QuestionsMapReduce Design /home28

MapReduce Design Patterns CS205: Computing Foundations for Computational Science Dr. Ignacio M. Llorente 26 Other Patterns Organization, Join and Input/Output üSummarization patterns: Get a top-level view by summarizing and grouping data üFiltering patterns: