Available online at www.sciencedirect.com

SciVerse ScienceDirect

Procedia Engineering 38 (2012) 3215 - 3221

Knowledge-Based Systems

Analyzing Distillation Process of Hidden Terms in Web

Documents for IR

M. Pradeepaa, Dr. C. Deisyb' a*

aCSE Dept, Bharath Niketan Engineering college, Aundipatti, Theni(dt), Tamil Nadu, India. bCSE Dept, Thiagarajar College of Engineering, Madura, Tamil Nadu, India.

Abstract

The previous work in web based applications such as mining web content, pattern recognition and similarity measures between the web documents. This paper is about, analyzing web documents in an enhanced way and delve the distillation web document will be the next pace in hypertext mining. The sparse document is a very little data on the web, which may face problems like different words with almost identical or similar meanings and sparseness. Natural language processing (NLP) and information retrieval (IR) are the main obstacles of the above problem. The mining of hidden terms discovers the search queries from large external datasets (universal datasets). It helps to handle unseen data in a better way. The goal of this web document mining consists of an efficient information finding, filtering information based on user query, and discovers more topic focused keywords based on the rich source of global information datasets. The proposed method we use the Distillation model, it is the integration of probabilistic generative model, Gibbs sampling algorithm and deployment method. This model can be applied for different natural languages and data domains for achieving the goal.

© 2012 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Noorul Islam Centre for Higher Education

Keywords - Web mining, hidden terms, sparse data, Latent Dirichlet Allocation (LDA), Gibbs sampler and clustering.

I. Introduction

Web content mining intends to discover useful information from web documents. Web content data are unstructured (free texts) or semistructured data (HTML documents). Discovering the content of web documents is a critical part in information retrieval (IR). The main work of this paper is to identify similar meaning words and repeated terms from the web documents and discover it in most identifiable form, which can be displayed in an easy understandable format for the web user. It helps more user attention while they search web content. Because the web page discovers important terms from the web documents rather than Meta description tag. The important problem in this work is identifying similar meaning.

Synonyms are different words with almost identical or similar meanings. Hypernyms and hyponyms are words that refer to, respectively, a general category and a specific instance of that category. For example, fruit is a hypernym of apple, and apple is a hyponym of fruit. Homonyms are simultaneously homographs are words that share the same spelling, irrespective of their pronunciation and homophones are words that share the same pronunciation, irrespective of their spelling. Homonyms are words that share the same spelling and the same pronunciation but have different meanings. Examples of homonyms are the pair Bear (animal) and bear (carry) [5].

* Corresponding author.

E-mail address: pradeenila@gmail.com(M. Pradeepa), cdcse@tce.edu(Dr. C. Deisy)

1877-7058 © 2012 Published by Elsevier Ltd. doi:10.1016/j.proeng.2012.06.372

The method such as word clustering and document clustering has been used to enhance document representations. The well-known Latent Semantic Indexing (LSI) technique and probabilistic Latent Semantic Indexing (pLSI) technique uses a latent variable model. It can represent documents as combination of terms. The existing system they used the above techniques to identify web documents and most of the current research focused on the document clustering methods. They are based on the Vector Space Model (VSM), which can be used widely in data representation for text classification and clustering. The VSM represents each document as a feature vector of the terms in the document. Each feature vector contains frequencies of the terms in the document. The relationship between the documents is measured by similarity measures based on feature vector. (E.g., Measures are cosine measure and the Jaccard measure). In text mining techniques, the term frequency of a term is calculated to discover the importance of the term in the document. Suppose, the two terms can have the same frequency in their documents, but one term gives more to the meaning of its sentences than the other term. The proposed model provides the semantic structure of each term in the document. Three measures for analyzing concepts on the sentence, document, and corpus levels are computed. We collected a very large universal dataset from different external source and then build model for distillation of term inference, clustering and enhanced process of hidden terms in web content.

The rest of the paper is organized as follows: Section 2 presents the literature survey. The concept-based mining model which includes sentence-based, document based, combined approach concept analysis, and concept based similarity measure model for information seeking, is presented in Section 3. Analysis approach is presented in Section 4. The last section summarizes the conclusions and future work.

II. LITERATURE SURVEY

There has been a moderate amount of previous work that is directly relevant to distillation of web documents. In most cases, the hidden terms from documents was not applied to web pages, but instead was applied to a different text domain. Most of the earlier systems were fairly simple, using a small number of features and a simple generative method. In this section, we describe this previous research in more detail.

The first related studies focused on the topic finding [10] [18] or text trend analysis [19] works are also related to our method. We are given titles and short snippets rather than whole documents. We train regression model for the ranking of cluster names, which is closely related to the efficiency of users' browsing. The main work of this paper is related to Xuan-Hieu Phan, Cam-Tu NguyenPhan 2011 [1] developed a framework of hidden topic for short web documents. This framework focuses two challenges posed by these kinds of documents : (1) data sparseness and (2) synonyms/homonyms. The result of this work is short documents less sparse and more topic-oriented.

Web-based metrics that compute the semantic similarity between words or terms is compared. This work performed automatically; do not require any human-annotated knowledge resources, context-based similarity metrics significantly outperform co-occurrence-based metrics, in terms of correlation with human judgment, for both tasks. In addition, unsupervised context-based similarity computation algorithms are shown to be competitive with the state-of-the-art supervised semantic similarity algorithms that employ language-specific knowledge resources [17, 24]. Extracting a query-oriented snippet (or passage) and highlighting the relevant information in long document can help reduce the result navigation cost of end users. A language model based method for accurately detecting the most relevant passages of a given document. The passage retrieval which focuses on searching relevance nodes for filtering of preoccupied passages, it focuses on query-informed segmentation for snippet extraction [4].

The ontology terms are developed through a social process, maintained and kept current by the Wikipedia community, represent a consensus view, and have meaning that can be understood simply by reading the associated Wikipedia page. Zareen Tim and Anupam [9] focused on solving this problem by using Wikipedia articles and the category and article link graphs to predict concepts common to a set of documents. They describe several algorithms to aggregate and refine results, including the use of spreading activation to select the most appropriate terms. While the Wikipedia category graph can be used to predict generalized concepts, the article links graph helps by predicting more specific concepts and concepts not in the category hierarchy.

A document-collection of a search engine, even though they may seem to include a lot of documents, is too sparse for answering a unique question. They have only past information not satisfactory for answering novel queries. For overcoming this situation, a search engine is desired to help user create knowledge from sparse documents. A novel information retrieval method named combination retrieval used for this purpose. The basic idea is that an appropriate combination of existing documents may lead to create novel knowledge, while each one document may be short of answering the novel query. Based on the principle that combine ideas triggers the creation of new ideas, a system to obtain and present an optimal combination of documents to the user, optimal in that the solution forms a document-set which is the most readable (understandable) and reflecting the user's context [3]. Phan, Nguyen, and Horiguchi 2008 [21] classify short and sparse text and web with hidden topics from a large data set. Svitlana Volkova [6] evaluated a wide range of similarity measures techniques for web documents. Those techniques are:

i. LATENT SEMANTIC ANALYSIS (LSA)

Latent semantic analysis (LSA) is a technique for analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA can use a term-document matrix which describes the occurrences of terms in documents. It is a sparse matrix whose rows and columns correspond to terms and documents respectively. LSA can basically identify the relationship between words and their stem terms.

ii. PROBABILISTIC LATENT SEMANTIC ANALYSIS (PLSA)

Probabilistic latent semantic analysis (pLSA) is a statistical technique for the analysis of two-mode and cooccurrence data. pLSA evolved from LSA, adding a sounder probabilistic model. This analysis is based on a mixture decomposition derived from a latent class model. The pLSA model posits that each word of a training document comes from a randomly chosen topic. The topics are themselves drawn from a document-specific distribution over topics, i.e., A point on the topic simplex. There is one such distribution for each document; the set of training documents thus defines an empirical distribution on the topic simplex. Problem of pLSA:

• Incomplete: Provide no probabilistic model at the level of documents.

• The number of parameters in the model grows linearly with the size of the corpus. It is not clear how to assign probability to a document outside of the training data.

Hi. LATENTDIRICHLETALLOCATION (LDA) &pLSA

LDA posits that each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter. This parameter is sampled once per document from a smooth distribution on the topic simplex.

III. DISTILLATION MODEL

The proposed distillation model is an integration of Latent Dirichlet Allocation, Gibbs sampling, and pattern deployment with relevant matching process to discover hidden terms from universal dataset. We make use of variable such as words, documents, and corpora are as follows:

• A word or term is defined to be an item from vocabularies from the natural language.

• A document is a sequence of N number of words w, which is denoted by D = (wls w2...w„), where w„is the nth word in the sequence.

A corpus is a collection of M documents denoted by C = Dl, D2...Dm.

We have to find a distillation model of a corpus that similar meaning word from the web documents [8]. Latent variables are variables that are not directly observed but are rather inferred through a mathematical model from other variables that are observed directly measured. A. LATENT DIRICHLET ALLOCATION (LDA)

In LDA, each document may be viewed as mixture distributions of various topics. Figure. 1 illustrates the working process of the Latent Dirichlet Allocation (LDA).

Figure 1. Latent Dirichlet Allocation

ad is the per-document topic distributions of the Dirichlet parameter.

/T is the per-topic word distribution of the Dirichlet parameter.

%d is the topic distribution for document i.

cpw word distribution for topic.

Y d is the topic for the word in the document, and

K)W is the exact word.

The cow is only manifest variables, as opposed to latent variables, are those variables that can be observed and directly measured. M denotes the total number of documents, K is the number of (hidden/latent) topics, N denotes

the length of the document and V is the dimension of the vocabulary, cp is aK*V Markov matrix each row of which denotes the word distribution of a topic.

The generative process behind is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for each document i in a corpus D [8]:

1. Choose 9j~Dir (a), where ie{l,...,M}

2. Choose (|>k ~Dir (P), where ks {1,...,K}

3. For each of the words Wy, where j e {l,...,Ni}

(a) Choose a topic yij~Multinominal(0j)

(b) Choose a word a)ij~Multinominal((|)Zij)

from the figure 1 we state the full joint probability of the data in the LDA model as follows:

1 w. >

ftp) - l[" TK»u v 0i;IV][fl*J<5|[«)1 [flrO-3]

Here ; ',1.: r¡;, ¿\, r ■ are multinomial distributions and the other two distributions ?\G. '3 ; fine í v. '3 - are Dirichlet[8],

B. GIBBS SAMPLING

Inference can be made by Gibbs sampling, which is one of the simplest Markov chain Monte Carlo algorithms. It generates a sequence of samples from the joint probability distribution of random variables. The idea of this sequence s to estimates the joint distribution of the variables such as, the unknown parameters or latent variables and the expected value of one of the variable. The goal of Gibbs sampling is to find estimates for the parameters of interest in order to determine how well the observable data fits the model of interest, and also whether or not data independent of the observed data fit the model described by the observed data [7,23]. Gibbs sampling needs Gibbs sampler for estimating the best assignments of terms for words and documents in a corpus. The algorithm is introduced in Tom Griffiths' paper "Gibbs sampling in the generative model of Latent Dirichlet Allocation" (2002) [12].

C. PATTERN DEPLOYING WITH RELEVANCE:

The several methods are used to discover terms by using a weighting function to assign a value for each term according to its frequency. Pattern Deploying with Relevance approach was implemented and evaluated in [18,22]. In this pattern mining approach, each found sequential pattern as a whole item without breaking them into a set of individual terms. Each mined sequential pattern p was given a value based on the following weighting function:

UHdl CD",p in dl}

~ fd2 d2 6 D.pindZj (2)

Where dl and d2 denote documents and D+ indicates positive documents in D. The low pattern frequency occurs due to the fact that it is difficult to match patterns in documents particularly when the length of the pattern is long. So we use a proper pattern deploying method to solve the low pattern frequency problem is needed. We propose a method to deploy discovered patterns with relevance (PDR).

We also need to determine the weight for each term in T when we use the discovered knowledge in set of document, which is denoted by r|. The weighting scheme for a given term t is denoted as the function:

V nftj wrf^Ctj) = ) (--

~ " (3)

In Pattern Deploying with Relevance (PDR) function obtain a set of frequent sequential patterns {pi, p2,..., pn} from each document, where p = h(tl, fl), (t2, f2), . . . , (tm, fm) and fi is the term frequency of term ti. The support of p in i2 can be described as

Jupfttpj XjQj)

3upporr(p)

^supp.Cp)

Where support(p) indicates the importance to the document that contains it and I(p) is a length reinforcement

derived by (len(p)) . The relationship between pattern and the term space is described by p. P(pM(ti,w1),(t2,w2),.. „(^wj}

Where wi is the term weight of ti. The probability of a term t can be described by using the following function.

pr£(i} = ^ ruppoitfr) X w rfi.: r.'jCi.pi

A relevance function for a document d can be defined as follows.

rvltvanct {d = V .'fiT(f.fi i

:f t £e

OtfwriVUB,

IV. THE OVERALL SYSTEM ARCHITECTURE

The entire system architecture of hidden terms in a web document showed in Figure 2. The architecture consists of four steps:

(a) Picking of universal dataset from the various data sources.

(b) Carrying out processing the distillation of hidden term inference model using Latent Dirichlet Allocation (LDA) and Gibbs sampling, which can be evaluates the document relationships. Each word in the document was generated by a hidden term inference model. For example, an LDA model may contain terms that can be classified as a medicine and media. But, the classification is random because the term that includes these words cannot be named. Moreover a term has probabilities of generating various words, such as doctor, patient, and drugs, which can be classified and interpreted by the viewer as "Medicine". Obviously, medicine itself will be having a high probability from the given terms. The "media" term also has probabilities of generating various words like television, newspapers, and journals.

(c) The building process of the clustering is a division of data into groups of similar objects. Cluster algorithms [11, 12] aim at dividing the set of objects into groups, where objects in each cluster are similar to each other and as dissimilar as possible to objects from other clusters. The hierarchical clustering algorithm combines or dividing existing group, creating a hierarchical structure that reflects the order in which groups are merged or divided. This algorithm work under following steps:

Compute the proximity matrix containing the distance between each pair of patterns. Treat each pattern as a cluster. 1. Find the most similar pair of clusters using the proximity matrix. Merge these two clusters into one cluster.

Update the proximity matrix to reflect this merge operation. If all patterns are in one cluster, stop. Otherwise, go to step 2.

Reasons for choosing this algorithm are a flexible level of granularity, easy to handle any form of similarity or distance, consequently applicability to any attribute types and this is more versatile [13,15].

Figure 2. The overall system architecture

(a) Picking a proper universal dataset

(b) Carrying out distillation of hidden term inference model

(c) Development of the clustering

V. PRE-EXPERIMENT RESULT

This paper discovers hidden terms from web documents using the distillation model. Information seeking or information retrieval process obtains information on both web users and technological environment. In general, the user enters required queries in search space and the web search engine displays search result element such as title, snippets (description about the document) and URL link. Rather than this distillation model will discover the title,

keywords and URL result elements by the search engine, which can be illustrated in figure 3. Its keywords are gathered based on the hidden terms from web document and universal dataset. Based on the preliminary work of this model, we examine the result of search information. Figure 4 shows the term based keywords for crawling Wikipedia [20]. Universal datasets collected from the Wikipedia and also data set from various sources. We will improve this universal datasets by using different technique to enhance the discovery patterns of web document. .

Figure 3. The result elements of search query form

Arts architecture, fine art, dancing, fashion, film, music... Business advertising, e-commerce, finance, investment... Computers hardware, software, database, multimedia... Education course, graduate, professor, university... Engineering automobile, telecommunication, civil eng.... Entertainment book, music, movie, painting, photos ... Health diet, therapy, healthcare, treatment, nutrition ... Mass-media news, newspaper, journal, television ... Politics government, legislation, party, regime, military... Science biology, physics, chemistry, ecology, laboratory... Sports baseball, cricket, football, tennis, Olympic games ... Misc. association, development, environment...

Figure 4. Term based keywords for crawling Wikipedia

IV. Conclusions

A distillation model is proposed to improve the discovery of hidden terms from the web documents. It provides an appealing approach for dealing with data sparseness, which may enhance the performance of existing models. It mainly focused on problems like data sparseness and synonym/homonym problems and provides a way to build result element key words are more topic focused based on the rich source of global information or universal dataset. The proposed result elements of search query form may lead to an enhanced distillation model of unseen data. Further study can be focused on successful implementation of distillation model and which can be compared and tested by various approaches improving existing performance and it will help us to overcome difficulties like noisy words and vocabulary mismatch, for better classification, clustering.

References

[1] Xuan-Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Le-Minh Nguyen, Susumu Horiguchi, and Quang-Thuy Ha. A Hidden Topic-based Framework towards building applications with Short Web documents. IEEE transactions on knowledge and data engineering, 2011.

[2] M. Sahami and T. Heilman. A Web-based kernel function for measuring the similarity of short text snippets. In WWW, 2006.

[3] Naohiro Matsumura, Yukio Ohsawa, Mitsuru Ishizuka. Combination retrieval for creating knowledge from sparse document-collection. Elsevier. Knowledge-based Systems 18 (2005) 327-333, May 2005.

[4] Qing Li, K. Selc.Yan Qi. Extracting Relevant Snippets from Web Documents through Language Model based Text Segmentation, IEEE conference, pages 287 - 290, January 2008.

[5] http://en.wikipedia.org/wiki/Synonym

[6] Svitlana Volkova .Latent Dirichlet Allocation Project report svolkova.weebly.com.

[7] Eric C. Rouchka . A Brief Overview of Gibbs Sampling. IBC Statistics Study Group, May 20,1997.

[8] David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993-1022.

[9] Zareen Saba Syed, Tim Finin and Anupam Joshi .Wikipedia as an Ontology for Describing Documents. Association for the Advancement of Artificial Intelligence (www.aaai.org),2007.

[10] Liu B., Chin C. W., and Ng, H. T. Mining Topic-Specific Concepts and Definitions on the Web. In Proceedings of the Twelfth International World Wide Web Conference (WWW'03), Budapest, Hungary, 2003.

[11] David Hand, Heikki Mannila, and Padhraic Smyth, Principles of Data Mining, The MIT Press, 2001.

[12]Pavel Berkhin, "Survey of Clustering Data Mining Techniques", unpublished (see http://citeseer.nj .nec.com/berkhin02survey.html), 2002.

[13]Osama Abu Abbas , Comparison Between Data Clustering Algorithm, The International Arab Journal of Information Technology, Vol.5, No.3, July 2008.

[14] Tom Griffiths' paper. Gibbs sampling in the generative model of Latent Dirichlet Allocation. (2002).

[15] L. Baker and A. McCallum. Distributional clustering of words for text classification. In ACMSIGIR, 1998.

[16] D.Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using Web search engines. In WWW, 2007.

[17] T. Hofmann. Latent semantic models for collaborative filtering. ACMTOIS, Vol.22, no.l, pp.89-115,2004

[18] Lawrie D. and Croft W. B. Finding Topic Words for Hierarchical Summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), pages 349-357, 2001.

[19] Lent B., Agrawal R., and Srikant R. Discovering Trends in Text Databases. In Proceedings of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD'97), Newport Beach, California, August 1997.

[20] http://gibbslda.sourceforge.net/wikipedia-topics.txt.

[21] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In WWW, 2008.

[22] J.Cai, W. Lee, and Y.Teh. Improving WSD using topic features. In EMNLP-CoNLL, 2007.

[23] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEEPAMI, Vol.6, pp.721-741, 1984.

[24] Elias Iosif, and Alexandras Potamianos. Unsupervised Semantic Similarity Computation between Terms Using Web Documents. IEEE transactions on knowledge and data engineering, Vol. 22, no. 11, 1637- 1647, November 2010.