Opait Summarizer

Welcome!

If you find yourself on the Internet a lot, following an idea or hunting for information, then the tool you are about to discover can help you save a lot of time. With the click of a button, right from your browser, you can examine a concise summary of any article before you decide whether or not to explore it further.

Installation

To install the Summarizer, click-and-drag the "Summarize" button below and drop it to where you want it on the Favorites bar or Bookmarks bar of your browser.

Summarize

A bookmark to access the Summarizer will be added to your browser, similar to the example below.

To completely unistall the Summarizer right click on its bookmark and select the Delete option.

Typical Use

In this segment, we will walk through a typical use of the Summarizer as part of a Google search. In this example, we googled for Information Overload and viewed one of the articles in the search results (you can view the article by clicking here). While the document was displayed, we clicked on the Summarize bookmark and received the following summary.

  • The header of the summary panel is the Title of the article as defined in the HTML file. The title bar is hyperlinked to the original article.
  • Following the title, there are 5 sentences from the article that received the heighest relevancy scores and are presented in their reading order. The number of sentences and the the overall size of the summary can be controlled using options that are descibed later in this document.
  • As a footnote to the summary, there are 5 highest ranking keywords or phrases that appear frequently in the article.

Modes of Operation

Certain Internet sites have password restrictions or generate their contents entirely using Javascript. The basic use model described in the previous segment may not work for such sites (the text retrieved from the URL will have little or no content).

In order to use the Summarizer with restricted sites, you will need to first select the portion of the article that you wish to summarize and then click on the Summarize bookmark to produce a summary. The convention is that if the Web page has any selection, then the Summarizer will only summarize the selected text. If there is no selection on the page, then the Summarizer will process the entire article.

Action Result
View article, Press Summarize Summary of the entire article is returned
View article, Select a segment, Press summarize Summary of the selected segment is returned

Source Document

When the Summarizer is activated from within browser using a bookmark, the Source Document is automatically set to either the URL address of the article or a selected segment within the article. Alternatively, you can access the Summarizer directly from http://Go.ShowSummary.com and set the Source Document manually.



Use one of the following three methods to specify the source of the document that you wish to summarize:

  1. Copy and paste the URL address (starting with http:// or https://) of an existing Web page.
  2. Use the browse button to select a file from your computer. You can select a variety of document formats including Text, PDF, HTML and Office documents.
  3. Copy and paste or directly type the text of the document to be summarized.

Options

Click on the Options bar to toggle between showing and hiding the Options panel.



Note: Options will be saved for 7 days if cookies are enabled in your browser.

Summary length:
The size of the summary as a percentage of the original document.
Source language:
The language of the source document. The language information can often be detected from the HTML file or from an analysis of the Unicode character ranges. For some documents, it may be necessary to set the language explicitly. The following languages are currently supported:
Minimum sentences:
A lower limit on the number of sentences that can be returned as a summary. This limit ensures that you will receive at least a certain number of sentences when the sizes of the original documents vary a great deal. If you want to always receive a fixed number of sentences (regardless of the original sizes), then set the number of sentences here and set the Summary length above to 0%.
Maximum keywords:
The maximum number of important words or phrases that should be returned. Set this value to zero if you don't want any keywords.
Highlight:
If you check this box, then, instead of returning the summary sentences, the entire source document will be returned with the summary sentences highlighted. It may be useful to examine the context of each ranked sentence to ensure that the summary makes sense. Please note that most of the formatting in the original document is ignored.
Show scores:
If checked, will append a relevancy score to each returned sentence or keyword. The higher the score, the more important is the sentence or keyword.

How it Works

"Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document." [1] Opait Summarizer was developed to study the effects of multiple techniques from Natural Language Processing (NLP) on automatic summarization of text documents. The summarizer assigns relevancy scores to various text elements found within the source document and uses the elements with highest scores to generate the summary.

  1. The format of the source document is detected and an appropriate text extraction filter is used to parse the text content of the document.
  2. Block structure of the document (e.g. sections and paragraphs) is extracted.
  3. Certain meta-data such as document title or description are identified.
  4. Sentences are extracted using a language-aware parser.
  5. Each sentence is tokenized by parsing for words, identifying stop words and applying a stemming algorithm.
  6. Document centroid is computed using average term frequencies.
  7. A set of features (see table below) are evaluated and an overall relevancy score is assigned to each sentence.
  8. Normalized sentence scores are used to rank sentences for inclusion in the summary. Duplicate or very similar sentences are also detected.
  9. NGrams are extracted to identify key words and phrases. Stop words are included but not counted towards NGram lengths.
  10. Term frequencies and NGram lengths are used to rank keywords and phrases. High ranking Ngrams are included in the summary.
  11. An appropriate view of the summary (e.g. HTML fragment, XML, JSON) is constructed and returned to the caller.

The following features are currently used to assign relevancy scores to sentences:

Feature Multiplier Description
Skimming 1.0 The first sentence in the document, sentences within the first paragraph, as well as the first sentence of each paragraph are weighted higher than other sentences. Subsequent sentences in a paragraph are assigned progressively smaller weights.
Term Frequency 2.0 A logarithmic form of Term Frequency * Inverse Document Frequency (TF*IDF) is used to assign highest weights to terms that appear often, but not too rarely or too frequently. Stop words are removed and a stemming algorithm is applied if defined. The current version only supports Porter 2 stemming algorithm for English text.
Title Overlap 1.5 If a title is detected within the document, the number of terms that are common between each sentence and the title are used to assign a weight to the sentence.
Description Overlap 1.5 If a tagged description is detected within the document, the number of terms that are common between each sentence and the description are used to assign a weight to the sentence.
Length 0.5 A function is used to demote sentences that are significantly shorter or longer than the mean length of all sentences in the article.
Readability 0.8 This feature only applies to English documents. It assigns a weight to a sentence based on the proportional appearance of so called "Spache" words, which is measure of how readable the content of the sentence is.
Graph Similarity 2.0 For documents of smaller sizes, a fully connected graph is constructed with sentences as nodes and Cosine Similarity between sentences as edges. This graph exists in an N-dimensional space where N is the number of unique terms in the source document. The mean of similarity of a sentence to all other sentences is used to assign a weight to the sentence. This algorithm can become expensive for larger documents. If the number of sentences is larger than a threshold (currently set at 100), the Graph Similarity is disabled in favor of a similarity measure between a sentence and the Centroid of the document (see next feature).
Centroid Similarity 1.0 A centroid vector of the document is calculated as the mean of term frequencies (more specifically, TF*IDF values). The cosine similarity of each sentence to the centroid vector is used as the weight of the sentence.

The overall score is a linear combination of weighted individual features and is normalized to fall into [0-100] range. The Multiplier column above shows empirical default boosters for each feature. Any feature may be disabled by setting the corresponding Multiplier to zero.


Displaying Features:

If you hold down the Alt key while clicking on the Summarize button, a detailed view of all the features that contributed to the ranking of the sentences in the summary will be displayed.


References:
  1. Wikipedia: Automatic Summarization.
  2. Stanford University Lecture Notes: CS276B Web Search and Mining. Lecture 14 Text Mining II
  3. J.Y. Yeh, H.R. Ke, and W.P. Yang, "ispreadrank: Ranking sentences for extraction-based summarization using feature weight propagation in the sentence similarity network," Expert Systems with Applications, vol. 35, no. 3, pp. 1451 - 1462, 2008.
  4. Juan Manuel Torres Moreno, Automatic Text Summarization, Wiley, Sep 25, 2014
  5. Dipanjan Das, Andr´e F.T. Martins, A Survey on Automatic Text Summarization, Language Technologies Institute, Carnegie Mellon University, November 21, 2007
  6. MEAD Documentation - A multi-document summarization toolkit written in perl.
  7. The SPACHE Readability Formula, G.Spache, 1953.
  8. Erkan, G., Radev, D 2004, LexRank: Graph-based Lexical Centrality as Salience in Text Summarization, journal of artificial intelligence research, vol. 22,pp. 457-479.

Server Interface

The Web server at http://Go.ShowSummary.com will accept both GET and POST requests with the parameters listed below. We recommend POST requests for higher security and to overcome the size limit of GET requests (specially when passing selected text).

Parameter Type Description
SourceUrl String Address of the source document.
SourceText String Actual text of the source document.
Language String Language code of the document.
SummaryPercent Integer Summary length as percentage of the original.
MinSentences Integer Minimum number of sentences in the summary.
MaxKeywords Integer Maximum number of keywords to return.
Highlight Boolean Highlight summary sentences in context.
ShowScores Boolean Include relevancy scores.
ShowDebug Boolean Include raw scores attached to features that are used to rank sentences.
AutoRun Boolean Generate summary immediately (as in the bookmarklet).

Notes:
  • You may use either 1/0 or true/false for Boolean values.
  • Language codes should be in their standard two letter formats. For example use "en" for English. If language code is not specified, then the language will be "Auto-Detected".

Thank you!

Please send us feedback on any aspects of this software and documentation, including:
  • Any problems or concerns.
  • Features and functionality that you would like to see in future releases.
  • Your specific use of the software and how can we make it more usable for you.
@ Contact us
© 2014 Opait