Download the PHP package marcelklehr/readability.php without Composer

On this page you can find all versions of the php package marcelklehr/readability.php. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package readability.php

Readability.php

Latest Stable Version Build Status Coverage Status StyleCI Total Downloads Monthly Downloads

PHP port of Mozilla's Readability.js. Parses html text (usually news and other articles) and returns title, author, main image and text content without nav bars, ads, footers, or anything that isn't the main body of the text. Analyzes each node, gives them a score, and determines what's relevant and what can be discarded.

Screenshot

The project aim is to be a 1 to 1 port of Mozilla's version and to follow closely all changes introduced there, but there are some major differences on the structure. Most of the code is a 1:1 copy –even the comments were imported– but some functions and structures were adapted to suit better the PHP language.

Lead Developer: Andres Rey

Requirements

PHP 7.3+, ext-dom, ext-xml, and ext-mbstring. To install all this dependencies (in the rare case your system does not have them already), you could try something like this in *nix like environments:

$ sudo apt-get install php7.1-xml php7.1-mbstring

How to use it

First you have to require the library using composer:

composer require marcelklehr/readability.php

Then, create a Readability class and pass a Configuration class, feed the parse() function with your HTML and echo the variable:

Your script will output the parsed text or inform about any errors. You should always wrap the ->parse call in a try/catch block because if the HTML cannot be parsed correctly, a ParseException will be thrown.

If you want to have a finer control on the output, just call the properties one by one, wrapping it with your own HTML.

Here's a list of the available properties:

If you need to tweak the final HTML you can get the DOMDocument of the result by calling ->getDOMDocument().

Options

You can change the behaviour of Readability via the Configuration object. For example, if you want to fix relative URLs and declare the original URL, you could set up the configuration like this:

Also you can pass an array of configuration parameters to the constructor:

Then you pass this Configuration object to Readability. The following options are available. Remember to prepend set when calling them using native setters.

Debug log

Logging is optional and you will have to inject your own logger to save all the debugging messages. To do so, use a logger that implements the PSR-3 logging interface and pass it to the configuration object. For example:

In the log you will find information about the parsed nodes, why they were removed, and why they were considered relevant to the final article.

Limitations

Of course the main limitation is PHP. Websites that load the content through lazy loading, AJAX, or any type of javascript fueled call will be ignored (actually, not ran) and the resulting text will be incorrect, compared to the readability.js results. All the articles you want to parse with readability.php need to be complete and all the content should be in the HTML already.

Known Issues

Javascript spilling into the text body

DOMDocument has some issues while parsing javascript with unescaped HTML on strings. Consider the following code:

If you would like to remove the scripts of the HTML (like readability does), you would expect ending up with just one div and one comment on the final HTML. The problem is that libxml takes that closing div tag inside the javascript string as a HTML tag, effectively closing the unclosed tag and leaving the rest of the javascript as a string within a P tag. If you save that node, the final HTML will end up like this:

This is a libxml issue and not a Readability.php bug.

There's a workaround for this: using the summonCthulhu option. This will remove all script tags via regex, which is not ideal because you may end up summoning the lord of darkness.

&nbsp entities disappearing

&nbsp entities are converted to spaces automatically by libxml and there's no way to disable it.

Self closing tags rendering as fully expanded tags

Self closing tags like <br /> get automatically expanded to <br></br. No way to disable it in libxml.

Dependencies

Readability.php uses the PSR Log interface to define the allowed type of loggers. Monolog is only required on development installations. (--dev option during composer install).

To-do

How it works

Readability parses all the text with DOMDocument, scans the text nodes and gives the a score, based on the amount of words, links and type of element. Then it selects the highest scoring element and creates a new DOMDocument with all its siblings. Each sibling is scored to discard useless elements, like nav bars, empty nodes, etc.

Testing

Any version of PHP installed locally should be enough to develop new features and add new test cases.

Code porting

Up to date with readability.js as of 19 Nov 2018.

License

Based on Arc90's readability.js (1.7.1) script available at: http://code.google.com/p/arc90labs-readability

Copyright (c) 2010 Arc90 Inc

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

All versions of readability.php with dependencies

PHP Build Version
Package Version
Requires php Version >=7.0.0
ext-dom Version *
ext-xml Version *
ext-mbstring Version *
psr/log Version ^1.0
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package marcelklehr/readability.php contains the following files

Loading the files please wait ....