Transcription

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 4, No. 11, 2013Inverted Indexing In Big Data Using HadoopMultiple Node ClusterKaushik VelusamyNivetha VijayarajuDept. of CSE Amrita UniversityCoimbatore, IndiaDept. of CSE Amrita UniversityCoimbatore, IndiaDeepthi VenkitaramananGreeshma SureshDept. of CSE Amrita UniversityCoimbatore, IndiaDept. of CSE Amrita UniversityCoimbatore, IndiaDivya MadhuDept. of IT Amrita UniversityCoimbatore, IndiaAbstract—Inverted Indexing is an efficient, standard datastructure, most suited for search operation over an exhaustive setof data. The huge set of data is mostly unstructured and does notfit into traditional database categories. Large scale processing ofsuch data needs a distributed framework such as Hadoop wherecomputational resources could easily be shared and accessed. Animplementation of a search engine in Hadoop over millions ofWikipedia documents using an inverted index data structurewould be carried out for making search operation moreaccomplished. Inverted index data structure is used for mappinga word in a file or set of files to their corresponding locations. Ahash table is used in this data structure which stores each wordas index and their corresponding locations as its values therebyproviding easy lookup and retrieval of data making it suitable forsearch operations.Keywords—Hadoop; Big data; inverted indexing; data structureI.INTRODUCTIONWikipedia is an online encyclopaedia which contains overfour million articles. In general, searching over such text baseddocuments involves document parsing, index, tokenisation,language recognition, format analysis, section recognition.Hence a search engine for such large data which is done in asingle node with a single forward index built over all thedocuments will consume more time for searching. Moreoverthe memory and processing requirements for this operationmay not be sufficient if it is carried out over a single node.Hence, load balancing by distribution of documents overmultiple data becomes necessary.Google processes 20PB of data every day using aprogramming model called MapReduce. Hadoop, a distributedframework that processes big data is an implementation ofMapReduce. Hence it is more apt for this operation asprocessing is carried out over millions of text documents.Inverted index is used in almost all web and text retrievalengines today to execute a text query. On a user query, thesesearch engines uses this inverted index to return thedocuments matching the user query by giving the hyperlink ofthe corresponding documents. As these indices contain thevocabulary of words in dictionary order only a small amountof documents containing the indices need to be processed.Here, the design of a search engine for Wikipedia data setusing compressed inverted index data structure over Hadoopdistributed framework is proposed. This data set containingmore than four million files needs an efficient search enginefor quick access of data. No compromise must be made on thesearch results as well as the response time. Care should betaken not to omit documents that contain words synonymoususer query. Since accuracy and speed is of primary importancein search, our methods could be favoured in such cases.II.LITERATURE SURVEY[2] For large-scale data-intensive applications like datamining and web indexing MapReduce has become animportant distributed processing model. Hadoop–an opensource implementation of MapReduce is widely used for shortjobs requiring low response time. Both the homogeneity anddata locality assumptions are not satisfied in virtualized datacentres. This paper [2] shows that ignoring the data localityissue in heterogeneous environments can noticeably reduce theMapReduce performance.The authors also address the problem of how to place dataacross nodes in a way that each node has a balanced dataprocessing load. Given a data intensive application running ona Hadoop MapReduce cluster, their data placement schemeadaptively balances the amount of data stored in each node toachieve improved data-processing performance. Experimentalresults on two real data-intensive applications show that theirdata placement strategy can always improve the MapReduceperformance by rebalancing data across nodes beforeperforming a data-intensive application in a heterogeneousHadoop cluster. The new mechanism distributes fragments ofan input file to heterogeneous nodes based on their computing156 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 4, No. 11, 2013capacities. Their approach improves performance of Hadoopheterogeneous clusters.According to [1], a virtualized setup of a Hadoop clusterthat provides greater computing capacity with lesser resourcesis presented, as virtualized cluster requires only fewer physicalmachines with master node of the cluster set up on a physicalmachine, and slave nodes on virtual machines (VMs).The Hadoop virtualized clusters are configured to usecapacity scheduler instead of the default FIFO scheduler. Thecapacity scheduler schedules tasks based on the availability ofRAM and virtual memory (VMEM) in slave nodes beforeallocating any job. Instead of queuing up the jobs, the tasks areefficiently allocated on the VMs based on the memoryavailable. Various configuration parameters of Hadoop areanalysed and the virtualized cluster is fine-tuned to for bestperformance and maximum scalability. The results show thatthe approach gives a significant reduction in execution times,which in turn shows that the use of virtualization helps inbetter utilization of the resources of the physical machinesused. Given the relatively under power of the machines usedin the real cluster the results were fairly relevant. The additionof more machines in the cluster leads to an even greaterreduction in runtime.According to [8], Hadoop, the emerging technology madeit feasible to combine it with virtualisation to process immensedata set. A method to deploy cloud stack with Map Reduceand Hadoop in virtualised environment was presented in thispaper. A brief idea on setting up a Hadoop experimentalenvironment to capture the current status and the trends ofoptimising Hadoop in virtualised environment was mentioned.The advantages and the disadvantages of the virtualisedenvironment was discussed, ending with the benefits of one'scompromise over the other. Making use of the virtualisedenvironment in Hadoop fully utilizes the computing resources,make it more reliable and save power and so on. On the otherside, we have to face the lower performance of virtualmachine. Then some methods to optimize Hadoop in virtualmachines were discussed.III. PROBLEM STATEMENTThe result of any user’s search query must be fast, shouldnot miss any relevant data related to the query. A searchengine designed by using distributed framework like Hadoopand inverted index data structure is fast and returns all therelevant results. In order to do this and to analyse thefeasibility of deployment of a search engine for Wikipediavarious requirements and parameters to be considered must bewell understood and analysed.IV.PARAMETERS FOR PERFORMANCE METRICSThe performance of a search operation through an invertedindex built over millions of Wikipedia documents distributedover a multiple node Hadoop cluster in a virtual node could beeffectively measured using various parameters such asresponse time ,throughput, speed up,latency hiding,computation time, communication time and therebycomputation and communication ratio. In terms of the searchoperation in this distributed and parallel platform, responsetime indicates the time taken for the first of the relevant wikidocuments to appear when a query is made. Through put inother words can be defined as the number of transactions persecond or the maximum number of search queries that can bemade per second, speed up factor refers to the time that couldbe saved due to a fraction of process that could be parallelized.As the documents are distributed across multiple documents,the percentage of search operation that can be parallelized andthereby the speedup achieved could be measured. [6]Speed-up factor Where-Time taken for serial execution of the processand- time taken after parallelization. As more time isconsumed in start-up of a communication between nodes,making use of this time effectively for completing as muchcomputations as possible would improve performance. Thiscan be achieved via non-blocking send routines therebyhelping in achieving latency hiding. Sometimes, even blockingsend routines allow computations to take place until theexpected messages reach the destination aiding in improvinglatency hiding. Total processing time includes computationsand the communications carried out.T process T computation T communicationThe computation time for the search operation can becalculated by counting the number of computations perprocess. Computation involves locating the node that has therelevant documents. [9]Communication time depends on thesize of the data transferred, start-up time for each message andnumber of messages in a process and the mode of datatransfer. Communication in multiple cluster node involvesrequesting a node for certain documents based on the queryand the nodes responding with the requested documents.T communication T start up w*T dataWhereT start-up – Time needed to send a blank messageT data - Time to send/receive a single data wordW- No. of data wordsSpeed-up factor The computation communication ratio throws light on thehow much time communication takes as a result of increasedamount of data.V.INVERTED INDEXINGIndexing refers to creating a link or a reference to a set ofrecords so as to enable better identification or location.Forward indexing and inverted indexing are two main types ofindexing. When an element say 97 is accessed through itsindex say Arr [3] in Fig 1, then it is forward indexing. Whenthe same element is searched based on its occurrence or thenumber of occurrences, then it is inverted indexing.Fig. 1. Illustration of Forward and Inverted Indexing157 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 4, No. 11, 2013An inverted index for a document or set of documentscontains a hash table with each word as its index and a postinglist as value of each index. A postings list consists of adocument id, position of word in that document and frequencyof occurrence of each word in that document.Fig. 2. Inverted Indexing-WorkingIf there are n documents to be indexed then a uniquedocument id is set for each document from 0 to n-1. Thepostings list for a term is sorted based on various criteria.Though it is easy to sort it based on document id, for searchoperations other parameters are considered for sorting. Sortingdone based on frequency of a term in a document is more aptfor a search operation. At the end of sort processing this datastructure returns the top k documents in the postings list wherek is the maximum returning capacity of a search engine in asingle stretch. [7].A. AlgorithmInverted Index (int docID[n], string doc[n])M new HashMapCount 0For all document with docID m from 0 to n-1For all term tm and position pos in docWith docID m doM {tm, previous pos, previous m} M{tm, pos, m} 1Count (tm, m) For each tm in M with docID mSort (count (tm, m))As explained in Fig 2, input to the indexing algorithm isthe set of document IDs and the contents of all the documents.Each new term in the document is formed as an index in thehash table. For each occurrence in a document its documentID is added to the postings list of that term along with itsposition. After each occurrence of a term in a document itscorresponding frequency variable count is incremented.Postings list of each term is finally sorted based on thefrequency of words in each document.Algorithm Search (HashMap M, string word)return M[word]In the search part of an inverted index, the word which isqueried by the user is passed as input along with the hash mapwhich has the set of all positions of the each word in thedocument. Hash map takes the word as its index and returnsthe value stored in that index.VI. DATA SET - WIKI DUMPSAll the contents of Wikipedia are available indownloadable format as wiki dumps. This can be taken byusers for archival/backup purposes, offline storage,educational purpose, for republishing, etc. There are over fourmillion files in Wikipedia, compressed form as wiki dumps ofsize 9.5 Giga bytes approximately. When extracted from thecompressed form, it comes to around 44 Giga bytes. Databasebackup dumps have a complete copy of all Wikipediadocuments as wikitext and the set of all its metadata in XML.Static HTML dumps has copies of all pages of Wikipediawikis in HTML form.Contents of dumps include page-to-page link, mediametadata, title, information about each page, log data, Miscbits, etc. These are in the wrapper format described at schemaExport Format which is compressed in bzip2 and .7z format.They are provided as dumps consisting of entire tables usingmysqldump. Internal file system limit must be taken intoaccount before extracting these files from compressed format.VII. HADOOPMap Reduce method has emerged as a scalable model thatis capable of processing pet a bytes of data. Fundamentalconcept of MapReduce: Rather than working on one, hugeblock of data with a single machine, Big Data is broken upinto files that further are broken into blocks by Hadoop andparallel processing and analysis is carried out. [5]The Hadoop is a map reduce framework that providesHDFS (Hadoop Distributed File Systems) infrastructure.HDFS was designed to operate and scale on commodityhardware. Breakdown in hardware is largely compensated byreplication of blocks of data in multiple nodes.A. Hadoop Distributed Filesystem (Hdfs) OverviewHDFS (Hadoop Distributed File System) is a distributeduser level file system which stores, processes, retrieves andmanages data in a Hadoop cluster. HDFS infrastructure thatHadoop provides, include a dedicated master node calledName Node which contains a job tracker, stores meta-data,controls the overall distributed process execution by checkingout whether all name nodes are functioning properly throughperiodic heart beats. It also contains many other nodes calledData Node which contains a task tracker, stores applicationsdata. The Ethernet network connects all nodes. HDFS isimplemented in Java and it is platform independent. Files inHDFS are split into blocks and each block is stored as anindependent file in the local file system of Data Nodes. Eachblock of a HDFS file is replicated at least three times inmultiple Data Nodes. Through replication of application data,provides data durability.[9]The Name Node manages the namespace and physicallocation of each file. File system operations are done by HDFSclient by contacting the Name node. Name Node checks aclient’s access permission and gets the list of Data Nodeshosting replicas of blocks. Then, requirements are sent to the“closest” Data Node by requesting a particular block. Then, a158 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 4, No. 11, 2013socket connection is created between the client and the DataNode. The data is transferred to the client application. When aclient application writes a HDFS file, it first splits the file intoHDFS blocks and the Name Node gets the list of Data Nodeswhich are replicas of each block and writing data is done bymultithreading. [3]If the client application is running on a Data Node, the firstreplica of the file is written into the local file system of therunning Data Node. If the client application isn’t running on aData Node, a socket connection is created between the clientand the first Data Node. The client splits the block into smallerpackets and starts a pipeline: the client sends a packet to thefirst Data Node; the first Data Node on getting this packet,writes this to the local file system, and also sends it to the nextData Node. A Data Node can receive the data from a previousnode and at the same time forward the data to the next node.When all nodes in this pipeline write the block into local filesystem successfully, the block write is finished and then DataNodes update the block physical information to the NameNode. The architecture of multiple cluster implementationshas been explained in Fig 3.B. Working process of Hadoop ArchitectureHadoop is designed to run on a large number of machinesthat don’t share any memory or disks. When a data is loadedinto Hadoop, the software splits that data into pieces andspreads it across different servers. Hadoop keeps track ofwhere the data resides. And because there are replica of singledata, data stored on a server that goes offline or dies can beautomatically replicated from a known good copy.In a Hadoop cluster, every one of those servers has two orfour or eight CPUs. Each server operates on its own little pieceof the data. Results are then delivered back through reduceoperations. MapReduce maps the operation out to all of thoseservers and then reduces the results back into a single resultset. Since Hadoop spreads out data, it is possible to deal withlots of data. Since all the processors work in parallel andharness together, complicated computational questions can beperformed. Node failures are automatically handled by theframework for both map and reduce functions.VIII. ASSUMPTIONS AND GOALSApplications that run on HDFS have large data sets. Atypical file in HDFS is Gigabytes to Terabytes in size.Therefore, HDFS must provide high bandwidth and scalabilityto hundreds of nodes. HDFS applications need a write-onceread-many access model for files. If a file is created andwritten, it is assumed that it will not be changed in future. Thisis to simplify data coherency and to get high throughput dataaccess.IX.PROPOSED SOLUTIONA Hadoop cluster is established by passing Wikipedia filesas input data and inverted indexing is done by takingadvantage of Map Reduce.In the map phase, the Wikipedia files are divided equallyamong mappers and passed as inputs. Each Wikipedia file isgiven a unique document ID. Each mapper indexes each termin its file into the hash map with the corresponding documentID and position in that document as a posting list. When itfinds that term for the first time it creates that term as theindex and writes the corresponding postings list of that term.When the term is found again, the corresponding posting listfor that position is appended with the previous list to indexholding that term.A. Map function pseudo-codeAlgorithm Map (int docID[x], string doc[x])M new HashMapCount 0For all document of docID m from 0 to x-1For all term tm and position pos inwith docID m doM {tm, previous pos, previous M{ tm, pos, m } 1Count (tm, m) emit (M, count (tm, m))Fig. 3. Hadoop Multiple node Cluster Architecturedocm}In the above algorithm X is the maximum number ofdocuments processed within a mapper. The input file is read159 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 4, No. 11, 2013word by word and indexed accordingly with its document IDand corresponding position in a hash map. The variable countkeeps track of the frequency of a term within each documentin that mapper. At the end each mapper returns its hash mapwith the count value of each term in a document.In reduce phase each reducer takes in its responsibility aterm or set of terms. These terms are given an index positionin a global hash map where all the terms are stored as index.When a reducer encounters its term from a mapper it appendsthe posting list of that mapper to its value in this hash map.After appending the entire list of that term from all themappers, reducer sorts posting list based on count value ofeach document. The more the value, the preference is higher.In the same way, all the terms in this whole document areindexed in the hash map in this reduce phase.B. Reduce function pseudo-codeAlgorithm Reduce (term tm, List of hash maps of eachmapper[], count{tm, docID})G new HashMap //G is common HashMap for allreducersfor each hash map H from all mappersfor each term tm in document with docID m andposition pos in H//n is the total number of documentsG{ tm , previous pos, previous m} H{ tm, pos , m } 1Sort( count (tm , m ))//values in list is sorted based on the count value ofeach term in a documentemit(G)In the above pseudo code each reducer takes as its input allthe hash maps of various mappers and the count values of eachterm in a document. Reducer checks each hash map with itsallotted term and if it matches with any mapper’s index itappends that value in global hash map. When all the values areappended for a term it is finally sorted based on its count valuein each document.C. RetrievalThe terms in global hash map is divided among themappers along with their corresponding posting list. When theuser queries a term, the name node sends this query to thecorresponding data node. Value of the term is passed to thereducer as a complete list. Reducer returns the first k values ofthat term to the user where k is the maximum number of pagesreturned for a user query.X.FUTURE WORKSFirst a distributed, multiple node Hadoop cluster has beenbuilt and the millions of wiki documents has been distributedover these nodes. A compressed inverted index containingindices for words in dictionary order is to be built over thesedocuments. After building inverted index, distributedperformance evaluation for searching documents based onkeyword is intended to be made. Further data analysis and textmining could be made based on index support. The results oftext mining and data analysis would help in suggesting relatedpages based on data such as other documents where thesynonyms of the query are predominantly found. Indexing canbe further partitioned in to local index partitioning and globalindex partitioning. In term based partitioning or global indexpartitioning, each node in the multiple cluster stores postinglist only for a subset of the term in the collection. Local indexpartitioning is each server building a separate index for thefiles that it contains. When this is done, each server indexesonly the document that it contains, reducing the number ofdocuments to thousands. This is very much lesser compared tothe actual number of indices that had to be built if indexing isto be done for over a million documents. So, instead ofbuilding a single index over 4 million Wikipedia documents,local index would be built over documents on each node andan index would be built on these indices thereby quickeningsearch and compressing indices. Further, indices built overarticles (a, the, an) and other such common words would bedeleted for adding accuracy.XI. CONCLUSIONIn this paper, a compressed inverted index data structurethat could help in crawling for words in dictionary order suchthat all the indices built for millions of documents need not beprocessed has been proposed. In addition, basic factors fordesigning indices such as merge factors, storage technique,index size, look up speed, maintenance, fault tolerance etc.will also be taken into account. Building a local index for fileswithin those system will prove to be a great way to solveproblems that could arise in parallelism such as when a file isadded, whether index updating should happen before thesearch operation that is currently going on and vice versa asonly a portion of the entire set of documents need to beupdated making the ‘index merging’ process very simple. Inaddition to storing a token word, its document id and theposition in which it appears, we have suggested to add tokenword document id and its frequency to rank up the relevantdocuments. Our work has motivated several interesting openquestions such as which type of inverted index data structurewould be most useful for text mining. Other ways to optimiseperformance in search is being investigated and added over tothe suggested methods.[1][2][3]REFERENCESRaj, A. Kaur, K. ; Dutta, U. ; Sandeep, V.V. ; Rao, S. "Enhancement ofHadoop Clusters with Virtualization Using the Capacity Scheduler",Third International Conference on Services in Emerging Markets(ICSEM),Mysore, India, Dec 2012. Page(s): 50 - 57. Print ISBN: 978-14673-5729-6. INSPEC Accession Number: 13343537. .ieee.org/xpl/abstractAuthors.jsp?arnumber 6468179Jiong Xie; Shu Yin ; Xiaojun Ruan ; Zhiyang Ding ; Yun Tian ; Majors,J. ; Manzanares, A. ; Xiao Qin. "Improving MapReduce performancethrough data placement in heterogeneous Hadoop clusters". IEEEInternational Symposium on Parallel & Distributed Processing,Workshops and Phd Forum (IPDPSW), Atlanta, GA, April, 2010.Page(s): 1 - 9. Print ISBN: 978-1-4244-6533-0. INSPEC AccessionNumber: 11309800. D.O.I : 10.1109/IPDPSW.2010.5470880. .jsp?arnumber 5470880Kala Karun, A ; Chitharanjan, K ; "A review on hadoop — HDFSinfrastructure extensions ", IEEE Conference on Information &Communication Technologies (ICT), JeJu Island, April 2013. Page(s):132 - 137. Print ISBN: 978-1-4673-5759-3. INSPEC Accession sp?arnumber 6558077160 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 4, No. 11, 2013[4][5][6][7]Richard Mccreadie ; Craig Macdonald ; Iadh Ounis; "MapReduceindexing strategies: Studying scalability and efficiency". InternationalJournal of Information Processing and Management. Volume 48 Issue5, September, 2012. Pages: 873-888. Publisher Pergamon Press, Inc.Tarrytown,NY,USA.ISSN:0306-4573doi itation.cfm?id 2337723Apache Hadoop, Hadoop, HDFS, Avro, Cassandra, Chukwa, HBase,Hive, Mahout, Pig, Zookeeper are trademarks of the Apache SoftwareFoundation. http://www.hadoop.apache.org/ Last Published: 10/16/201306:37:41. Copyright 2012. The Apache Software Foundation. 2ndOctober 2013.Barry Wilkinson; Michael Allen; “Parallel Programming: Techniquesand Applications Using Networked Workstations and ParallelComputers” (2nd Edition). Publication Date: March 14, 2004, ISBN-10:0131405632, ISBN-13: 978-0131405639 , Edition: 2. Link quesApplications-Workstations/dp/0131405632Gal Lavee ; Ronny Lempel ; Edo Liberty ; Oren Somekh ; " Invertedindex compression via online document routing" Published in: WWW'11 Proceedings of the 20th international conference on World 1145/1963405.1963475. Publisher ACM New York, NY, USA 2011. Link: http://dl.acm.org/citation.cfm?id 1963475[8]Guanghui Xu; Feng Xu; Hongxu Ma; "Deploying and researchingHadoop in virtual machines". Published in: IEEE InternationalConference on Automation and Logistics (ICAL), Zhengzhou, Aug.2012. Page(s): 395 - 399. ISSN: 2161-8151. E-ISBN: 978-1-4673-03637. Print ISBN: 978-1-4673-0362-0. INSPEC Accession sp?arnumber 6308241[9] Shvachko, K.; Hairong Kuang ; Radia, S. ; Chansler, R. " The HadoopDistributed File System". Published in: IEEE 26th Symposium on MassStorage Systems and Technologies (MSST), Incline Village, NV, May2010. Page(s): 1 - 10. E-ISBN: 978-1-4244-7153-9. Print ISBN: 978-14244-7152-2. INSPEC Accession Number: 11536653. lore.ieee.org/xpl/articleDetails.jsp?arnumber 5496972[10] Ishii, M.; Jungkyu Han; Makino, H; "Design and performance evaluationfor Hadoop clusters on virtualized environment" Published in:International Conference on Information Networking (ICOIN),Bangkok, Jan. 2013. Page(s): 244 - 249. ISSN: 1976-7684. E-ISBN:978-1-4673-5741-8. Print ISBN: 978-1-4673-5740-1. lore.ieee.org/xpl/articleDetails.jsp?arnumber 6496384161 P a g ewww.ijacsa.thesai.org

hash table is used in this data structure which stores each word as index and their corresponding locations as its values thereby providing easy lookup and retrieval of data making it suitable for search operations. Keywords—Hadoop; Big data; inverted indexing; data structure I. INTRODUCTION Wikipedia is an online encyclopaedia which contains .