Download the PHP package bigwhoop/sentence-breaker without Composer

On this page you can find all versions of the php package bigwhoop/sentence-breaker. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.

FAQ

After the download, you have to make one include require_once('vendor/autoload.php');. After that you have to import the classes with use statements.

Example:
If you use only one package a project is not needed. But if you use more then one package, without a project it is not possible to import the classes with use statements.

In general, it is recommended to use always a project to download your libraries. In an application normally there is more than one library needed.
Some PHP packages are not free to download and because of that hosted in private repositories. In this case some credentials are needed to access such packages. Please use the auth.json textarea to insert credentials, if a package is coming from a private repository. You can look here for more information.

  • Some hosting areas are not accessible by a terminal or SSH. Then it is not possible to use Composer.
  • To use Composer is sometimes complicated. Especially for beginners.
  • Composer needs much resources. Sometimes they are not available on a simple webspace.
  • If you are using private repositories you don't need to share your credentials. You can set up everything on our site and then you provide a simple download link to your team member.
  • Simplify your Composer build process. Use our own command line tool to download the vendor folder as binary. This makes your build process faster and you don't need to expose your credentials for private repositories.
Please rate this library. Is it a good library?

Informations about the package sentence-breaker

sentence-breaker

Build Status

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Installation

composer require bigwhoop/sentence-breaker

Usage

<?php
use Bigwhoop\SentenceBreaker\SentenceBreaker;

$breaker = new SentenceBreaker();
$breaker->addAbbreviations(['Dr', 'Prof']);

// returns a generator, the text is parsed lazily
$sentences = $breaker->split("Hello Dr. Jones! How are you? I'm fine, thanks!");

// get first
$sentences->current() // 'Hello Dr. Jones!'

// get all as array
iterator_to_array($sentences) // ['Hello Dr. Jones!', 'How are you?', "I'm fine, thanks!"]

Rules

By default the rules/rules.ini file is loaded. Its format is a list of patterns ...

TOKEN [... TOKEN] = PROBABILITY
T_CAPITALIZED_WORD <T_PERIOD> T_WHITESPACE T_CAPITALIZED_WORD = 75

The token enclosed in < / > is the one that defines for which token the pattern is applied. The example pattern above would be applied to each T_PERIOD token found in the input data. The probability defines how likely a sentence boundary is after this token.

So for this pattern to match, the input text would need to contain something along the lines of This is Waldo. He likes dogs..

The available tokens are:

Token Description Example
T_WORD A non-capitalized word. hello, world
T_CAPITALIZED_WORD A capitalized word. Hello, World
T_EOF The end of the input. -
T_PERIOD A period. .
T_EXCLAMATION_POINT An exclamation point. !
T_QUESTION_MARK A question mark. ?
T_QUOTED_STR A string enclosed in single or double quotes "Hello world!", 'Hello world...'
T_WHITESPACE Whitespace characters like spaces, LF, CR. -
T_ABBREVIATION An abbreviation without the trailing period. Dr, Prof

TIP: You can add your own rules via $breaker->addRules().

Abbreviation Providers

Inside the data directory are flat files containing abbreviations (in English), collected from various sources. They can be loaded like this:

use Bigwhoop\SentenceBreaker\Abbreviations\FlatFileProvider;

// Load legal.txt and biz.txt
$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['legal', 'biz']));

// Load all files
$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['*']));

To make it fast and easy, all abbreviations are available in the all.txt file. You can load it like this:

$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['all']));

How does it work?

The input text is run through a lexer.

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens, i.e. meaningful character strings.

So for example He asked: "What's on TV?" On T.V.? I have no clue. Really! would result in the following sequence of tokens:

"He" "asked:" T_QUOTED_STR "On" "T.V" T_PERIOD T_QUESTION_MARK
"I" "have" "no" "clue" T_PERIOD "Really" T_EXCLAMATION_POINT

This sequence of tokens is then run through a probability calculator that calculates for each token the probability of it being the boundary of a sentence. The calculator uses rules that are matched against each token. For example if a T_EXCLAMATION_POINT is followed by a capitalized string the chance of it being a sentence boundary is 100%.

In the end the tokens are re-assembled into the sentences. The user can choose which threshold he wants to apply when starting new sentences. For example the probability must be greater or equal to 50% that a boundary was detected.

TODO

Contributing

Contributors

Made with contrib.rocks.

License

MIT. See LICENSE file.


All versions of sentence-breaker with dependencies

PHP Build Version
Package Version
Requires php Version ^8.0
ext-simplexml Version *
Composer command for our command line client (download client) This client runs in each environment. You don't need a specific PHP version etc. The first 20 API calls are free. Standard composer command

The package bigwhoop/sentence-breaker contains the following files

Loading the files please wait ....