Download the PHP package crscheid/php-article-extractor without Composer

On this page you can find all versions of the php package crscheid/php-article-extractor. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package php-article-extractor

PHP Article extractor

This is a web article parsing and language detection library for PHP. This library reads the article content from a web page, removing all HTML and providing just the raw text, suitable for text to speech or machine learning processes.

For a project I have developed, I found many existing open source solutions good starting points, but each had unique failures. This library aggregates three different approaches into a single solution while adding the additional functionality of language detection.

How To Use

This library is distributed via packagist.org, so you can use composer to retrieve the dependency

Calling via URL

This library will attempt to retrieve the HTML for you. You need simply to create an ArticleExtractor class and call the parseURL function on it, passing in the URL desired.

The function processURL returns an array containing the title, text, and meta data associated with the request. If the text is null then this indicates a failed parsing. Below should be the output of the above code.

The field result_url will be different if the library followed redirects. This field represents the final page actually retrieved after redirects.

Calling with HTML

If you already have HTML, you can use the parseHTML function and use your HTML processed through the same logic.

The function parseHTML returns an array containing the title, text, and meta data associated with the request. If the text is null then this indicates a failed parsing. Below should be the output of the above code.

The field result_url will not be included in this case since we are not attempting to get the HTML during the process call.

You can also create the ArticleExtractor class by passing in a key for the language detection service as well as a custom User-Agent string. See more information below.

Options

Language Detection Methods

Language detection is handled by either looking for language specifiers within the HTML meta data or by utilizing the Detect Language service.

If it is possible to detect the language of the article, the language code in ISO 639-1 format as well as the detection method are returned in the fields language and language_method respectively. The language_method field, if found successfully, may be either html or service.

If language detection fails or is not available, both of these fields will be returned as null.

Detect Language requires the use of an API KEY which you can sign up for. However, you can also use this library without it. If the HTML meta data do not contain information about the language of the article, then language and language_method will be returned as null values.

To utilize this library utilizing the language detection service, create the ArticleExtractor object by passing in your API KEY for Detect Language.

Setting User Agent

It is possible to set the user-agent for outgoing requests. To do so pass the desired user agent string to the constructor as follows:

Force Reading Method

It is possible to force the method by which the reading is attempted, either with Readability, Goose, or Goose with our custom processing. This can come in handy where Readability or Goose have particular issues with particular websites.

To force the method, simply provide a third argument to the constructor as such. The four valid methods are readability, goose, goosecustom, or custom.

Output Format

As of version 1.0, the output format has been altered to provide newline breaks for headings. This is important especially for natural language processing applications in determining sentence boundaries. If this behavior is not desired, simply strip out the additional newlines where needed.

This change was made due the fact that when header and paragraph HTML elements are simply stripped out, there often occurs issues where there is no separation between the heading and the proceeding sentence.

Example of Output Format for Text Field

Running tests

Unit tests are included in this distribution and can be run utilizing PHPUnit after installing dependencies. The recommended approach is to use Docker for this purpose, so you then don't even need to have dependencies installed on your system.

Note: Please set the environment variable DETECT_LANGUAGE_KEY with your Detect Language key in order for language detection in unit tests to work properly.

Installing Dependencies

This will use the composer docker image to download the requirements. Note the use of the --ignore-platform-reqs since some of our dependencies do not yet support PHP 8.

Running Unit Tests

This runs the phpunit dependency that we downloaded within the php 7.4 command line environment.


All versions of php-article-extractor with dependencies

PHP Build Version
Package Version
Requires php Version ~7.2
scotteh/php-goose Version ^1.1
thesoftwarefanatics/php-html-parser Version ^1.8.0
detectlanguage/detectlanguage Version 2.*
fivefilters/readability.php Version 2.1.0
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package crscheid/php-article-extractor contains the following files

Loading the files please wait ....