Articles

CONDENZA:A System for Extracting Abstract from a Given Source Document

Mgbeafulike .I .J, Ejiofor C. I,
Article Date Published : 5 February 2018 | Page No.: 23526-23530 | Google Scholar

Downloads

Download data is not yet available.

Abstract

Despite the increasingly availability of documents in electronic form and the availability of desktop publishing software, abstracts continue to be produced manually. The purpose of CONDENZA is to develop a system for abstract extraction from a given source document. CONDENZA describes a system on automatic methods of obtaining abstracts. The rationale of abstracts is to facilitate quick and accurate identification of the topic of published papers. The idea is to save a prospective reader time and effort in finding useful information in a given article or report. The system generates a shorter version of a given sentence while attempting to preserve its meaning. This task is carried out using summarization techniques. CONDENZA implements a method that combines apriori algorithm for keyword frequency detection with clustering based approach for grouping similar sentences together. The result from the system shows that our approach helps in summarizing the text documents efficiently by avoiding redundancy among the words in the document and ensures highest relevance to the input text. The guiding factors of our results are the ratio of input to output sentences after summarization.

CONDENZA:A System for Extracting Abstract from a Given Source Document

1.0 Introduction

The automatic extraction of abstracts from a give source text document has been a neglected area of information science. Despite the increasingly availability of documents in electronic form and the availability of desktop publishing software, abstracts continue to be produced manually. It is therefore the purpose of this study to develop a system for abstract extraction from a give source document. An abstract is a brief overview or summary of the central subject matter of a given document. It is typically a very condensed summary of a study that highlights the major points and concisely describes the content and scope of the study. The challenges of manually reading and summarizing documents cannot be overemphasized. Most often documents are treated in their thousands, especially in the education circles where academic materials have to be read and scanned through severally in order to understand the context of the materials. Certain factors responsible for making the process of manual processing such a difficult ordeal are: (i) Reading through a whole document and sorting out the essential points from it requires a lot of time and effort. (ii) A lot of man power is required to efficiently read and separate important extracts out from a document and can lead to high expenditure by the organization or body handing the processing of the documents. Automatic text summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Text summarization is also a technique where a computer automatically creates an abstract or summary of one or more texts. According to Babar [1] a summary is a text that is produced from one or more texts, that conveys important information in the original text, and it is of a shorter form. The goal of automatic text summarization is presenting the source text into a shorter version with semantics. The most important advantage of using a summary is that it reduces the reading time Technologies that can make a coherent summary

take into account variables such as length, writing style and syntax Abderrafih [2]. There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs Camargo et al. [3]. An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. This paper presents the current technologies and techniques as well as prevailing challenges in automatic text summarization; consequently, we propose a model for improving text summarization by using a method that combines apriori algorithm for keyword frequency detection with clustering based approach for grouping similar sentences together.

2.0 Related Work

Text summarization has been an area of interest for many years. The need for an automatic text summarizer has increased much due to the abundance of electronic documents. Mani et al. [4] defines text summarization as the process of distilling the most important information from single or multiple documents to produce a condensed version for particular user(s) and task(s). Shen et al. [5] differentiates the two approaches to text summarization as abstraction based and extraction based. Abstraction based approach understands the overall meaning of the document and generate a new text whereas the extraction based approach simply selects a subset of existing sentences in the original text to form the summary. Liang et al, [6] developed a BE (basic element)-based multi-document summarizer with query interpretation. The idea is to assign scores to BEs according to some algorithms, assign scores to sentences based on the scores of the BEs contained in the sentences, and then apply standard filtering and redundancy removal techniques before generating summaries. The experimental results show that this approach was very effective. Khresmoi text summarizer developed at Dublin City University (DCU), Ireland for the Khresmoi project. The aim of the summarizer is to provide a summarized view of medical documents for use in the Khresmoi system interface. To achieve this, the summarizer selects the most meaningful/interesting segments in a text, for inclusion in the summary, by using features to describe segments and weight the importance of segments in documents Kelly et al. [7]. Baxendale [8] presented experimental data on how the leading sentences of a document are more important than the ones at the end in terms of its informative content or significance. Hence the position of a sentence in a document forms an important selection criterion. Luhn H. P. [9] presented the idea that frequently occuring terms signify the overall content of the document. S. Brin et al. [10] used the Pagerank based score to rank the sentences which gives more importance to sentences that refer to others as well as are referred by others.

3.0 Materials and Methods

  1. Get the items (words) to be sorted

  2. Set an arbitrary value s that will indicate maximum frequency size (In this example =2 )

  3. Start Pass 1 through the items

  4. After the Pass 1, is completed, check the count for each item.

  5. If the count of item is more than equal to s i.e. (item ) >= , then the item i is frequent. Save this for next pass.

  6. After Pass 2 ends, check for the count of each pair of item

  7. If more than equal to , the pair is considered to be frequent, i.e. (item i, item ) >= .

The clustering algorithm is as shown below

Initialize i, i = 1,…,k, for example, to k random t

Repeat

For all t in X

bit  1 if || t - i || = minj || t - j ||

bit  0 otherwise

For all i, i = 1,…,k

i  sum over t (bit t) / sum over t (bit )

Until i converge

The algorithm for the abstraction is as shown below;

  1. Start

  2. Select Document

  3. Extract Sentences

  4. Set Abstract length

  5. Set an arbitrary value s that will indicate maximum frequency size

  6. Get frequency of word combinations in sentence

  7. Get number of sentences having the highest word combination frequencies

  8. Join sentences to produce the abstract document

  9. Display the abstract document

  10. nd

3.1 Proposed System

The proposed system is a single document summarization based on extractive techniques and will be implemented on text documents. The proposed system consists of five (5) phases which includes (i) preprocessing of input text, (ii) Document Analysis (iii) Filtering and Synthesis (iv) Abstract Generation (v) Abstract Document. The proposed work is a sentence extraction based single document summarization which creates a generic abstract of a given text document. This work uses a combination of Apriori algorithm and Clustering based methods to improve the quality of summary.

3.2 System Architecture

The system architecture is modeled in the diagram below.

Fig. 3.1 System Architecture

The proposed system for Extracting Abstract from a Given Source Document was implemented using the following tools: VB.NET which was used to develop the user interface of the software to provide the interaction layer to enable the users of the system to interact with it. It will also be used to implement the processing logic of the model that will be used to extract information from the documents. MS-Access system was used to create the database for storing the information on the documents that are to be extracted for the abstract information as well as store the produced abstracts,

The modules for the proposed Document Abstract Extraction System are presented below.

This module is used for the extraction of the document abstract.

3 System Implementation

The tests carried out showed that the documents were able to be extracted successfully and also the number of sentences that the abstract was to have could also be controlled by the researcher thus fulfilling and main objectives of this computer science research. A sample output is as shown below;

0 Result and Discussion

Fig. 4.1 Abstract Output Interface

The performance of the system is measured with the number of input words in the source document, number of words in the output summary file and its reduced words. Evaluation showed that there is at least 57% decrease of the output summary from the original input text.

5.0 Conclusion and Future Works

The project was aimed at providing a document extraction software system for the summarization of contents in a text document. This was achieved through the use of apriori algorithm and cluster algorithm that was used to weigh the best combination of words and sentences that contains most of the vital concepts of the documents. The result of this application will be enhanced by using spectral clustering and may side by side feature extraction to provide high level clustering.

The future work is expected to continue in this direction;

  1. The system can be implemented for multiple documents.

  2. The System may also consider using Genetic algorithm for faster algorithm implementation.

Fig. 3.2 Text Input Interface

1.0 Introduction

The automatic extraction of abstracts from a give source text document has been a neglected area of information science. Despite the increasingly availability of documents in electronic form and the availability of desktop publishing software, abstracts continue to be produced manually. It is therefore the purpose of this study to develop a system for abstract extraction from a give source document. An abstract is a brief overview or summary of the central subject matter of a given document. It is typically a very condensed summary of a study that highlights the major points and concisely describes the content and scope of the study. The challenges of manually reading and summarizing documents cannot be overemphasized. Most often documents are treated in their thousands, especially in the education circles where academic materials have to be read and scanned through severally in order to understand the context of the materials. Certain factors responsible for making the process of manual processing such a difficult ordeal are: (i) Reading through a whole document and sorting out the essential points from it requires a lot of time and effort. (ii) A lot of man power is required to efficiently read and separate important extracts out from a document and can lead to high expenditure by the organization or body handing the processing of the documents. Automatic text summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Text summarization is also a technique where a computer automatically creates an abstract or summary of one or more texts. According to Babar [1] a summary is a text that is produced from one or more texts, that conveys important information in the original text, and it is of a shorter form. The goal of automatic text summarization is presenting the source text into a shorter version with semantics. The most important advantage of using a summary is that it reduces the reading time Technologies that can make a coherent summary

take into account variables such as length, writing style and syntax Abderrafih [2]. There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs Camargo et al. [3]. An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. This paper presents the current technologies and techniques as well as prevailing challenges in automatic text summarization; consequently, we propose a model for improving text summarization by using a method that combines apriori algorithm for keyword frequency detection with clustering based approach for grouping similar sentences together.

2.0 Related Work

Text summarization has been an area of interest for many years. The need for an automatic text summarizer has increased much due to the abundance of electronic documents. Mani et al. [4] defines text summarization as the process of distilling the most important information from single or multiple documents to produce a condensed version for particular user(s) and task(s). Shen et al. [5] differentiates the two approaches to text summarization as abstraction based and extraction based. Abstraction based approach understands the overall meaning of the document and generate a new text whereas the extraction based approach simply selects a subset of existing sentences in the original text to form the summary. Liang et al, [6] developed a BE (basic element)-based multi-document summarizer with query interpretation. The idea is to assign scores to BEs according to some algorithms, assign scores to sentences based on the scores of the BEs contained in the sentences, and then apply standard filtering and redundancy removal techniques before generating summaries. The experimental results show that this approach was very effective. Khresmoi text summarizer developed at Dublin City University (DCU), Ireland for the Khresmoi project. The aim of the summarizer is to provide a summarized view of medical documents for use in the Khresmoi system interface. To achieve this, the summarizer selects the most meaningful/interesting segments in a text, for inclusion in the summary, by using features to describe segments and weight the importance of segments in documents Kelly et al. [7]. Baxendale [8] presented experimental data on how the leading sentences of a document are more important than the ones at the end in terms of its informative content or significance. Hence the position of a sentence in a document forms an important selection criterion. Luhn H. P. [9] presented the idea that frequently occuring terms signify the overall content of the document. S. Brin et al. [10] used the Pagerank based score to rank the sentences which gives more importance to sentences that refer to others as well as are referred by others.

3.0 Materials and Methods

  1. Get the items (words) to be sorted

  2. Set an arbitrary value s that will indicate maximum frequency size (In this example =2 )

  3. Start Pass 1 through the items

  4. After the Pass 1, is completed, check the count for each item.

  5. If the count of item is more than equal to s i.e. (item ) >= , then the item i is frequent. Save this for next pass.

  6. After Pass 2 ends, check for the count of each pair of item

  7. If more than equal to , the pair is considered to be frequent, i.e. (item i, item ) >= .

The clustering algorithm is as shown below

Initialize i, i = 1,…,k, for example, to k random t

Repeat

For all t in X

bit  1 if || t - i || = minj || t - j ||

bit  0 otherwise

For all i, i = 1,…,k

i  sum over t (bit t) / sum over t (bit )

Until i converge

The algorithm for the abstraction is as shown below;

  1. Start

  2. Select Document

  3. Extract Sentences

  4. Set Abstract length

  5. Set an arbitrary value s that will indicate maximum frequency size

  6. Get frequency of word combinations in sentence

  7. Get number of sentences having the highest word combination frequencies

  8. Join sentences to produce the abstract document

  9. Display the abstract document

  10. nd

3.1 Proposed System

The proposed system is a single document summarization based on extractive techniques and will be implemented on text documents. The proposed system consists of five (5) phases which includes (i) preprocessing of input text, (ii) Document Analysis (iii) Filtering and Synthesis (iv) Abstract Generation (v) Abstract Document. The proposed work is a sentence extraction based single document summarization which creates a generic abstract of a given text document. This work uses a combination of Apriori algorithm and Clustering based methods to improve the quality of summary.

3.2 System Architecture

The system architecture is modeled in the diagram below.

Fig. 3.1 System Architecture

The proposed system for Extracting Abstract from a Given Source Document was implemented using the following tools: VB.NET which was used to develop the user interface of the software to provide the interaction layer to enable the users of the system to interact with it. It will also be used to implement the processing logic of the model that will be used to extract information from the documents. MS-Access system was used to create the database for storing the information on the documents that are to be extracted for the abstract information as well as store the produced abstracts,

The modules for the proposed Document Abstract Extraction System are presented below.

This module is used for the extraction of the document abstract.

3 System Implementation

The tests carried out showed that the documents were able to be extracted successfully and also the number of sentences that the abstract was to have could also be controlled by the researcher thus fulfilling and main objectives of this computer science research. A sample output is as shown below;

0 Result and Discussion

Fig. 4.1 Abstract Output Interface

The performance of the system is measured with the number of input words in the source document, number of words in the output summary file and its reduced words. Evaluation showed that there is at least 57% decrease of the output summary from the original input text.

5.0 Conclusion and Future Works

The project was aimed at providing a document extraction software system for the summarization of contents in a text document. This was achieved through the use of apriori algorithm and cluster algorithm that was used to weigh the best combination of words and sentences that contains most of the vital concepts of the documents. The result of this application will be enhanced by using spectral clustering and may side by side feature extraction to provide high level clustering.

The future work is expected to continue in this direction;

  1. The system can be implemented for multiple documents.

  2. The System may also consider using Genetic algorithm for faster algorithm implementation.

References

  1. Camargo JorgeE, González FabioA. A Multi-class Kernel Alignment Method for Image Collection Summarization 2009;:545-552. | Google Scholar
  2. Sanderson Mark. Advances in Automatic Text Summarization Inderjeet Mani and Mark T. Maybury (editors) (MITRE Corporation) Cambridge, MA: The MIT Press, 1999, xv$\mathplus$434 pp$\mathsemicolon$ hardbound, ISBN 0-262-13359-8, $45.00 2000-jun;:280-281. | Google Scholar
  3. http://ljournal.ru/wp-content/uploads/2016/08/d-2016-154.pdf 2016. | Google Scholar
  4. Baxendale PB. Machine-Made Index for Technical Literature—An Experiment 1958-oct;:354-361. | Google Scholar
  5. Luhn HP. The Automatic Creation of Literature Abstracts 1958-apr;:159-165. | Google Scholar
  6. Brin Sergey, Page Lawrence. The anatomy of a large-scale hypertextual Web search engine 1998-apr;:107-117. | Google Scholar

Author's Affiliation

Copyrights & License

International Journal Of Engineering And Computer Science, 2018.
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Article Details


Issue: Vol 7 No 02 (2018)
Page No.: 23526-23530
Section: Articles
DOI:

How to Cite

.I .J, M., & C. I, E. (2018). CONDENZA:A System for Extracting Abstract from a Given Source Document. International Journal of Engineering and Computer Science, 7(02), 23526-23530. Retrieved from http://ijecs.in/index.php/ijecs/article/view/3949

Download Citation

  • HTML Viewed - 801 Times
  • PDF Downloaded - 192 Times
  • XML Downloaded - 3 Times