Download the PHP package acdh-oeaw/arche-ingest without Composer
On this page you can find all versions of the php package acdh-oeaw/arche-ingest. It is possible to download/install these versions without Composer. Possible dependencies are resolved automatically.
Download acdh-oeaw/arche-ingest
More information about acdh-oeaw/arche-ingest
Files in acdh-oeaw/arche-ingest
Package arche-ingest
Short Description A set of sample ARCHE ingestion scripts
License MIT
Homepage https://github.com/acdh-oeaw/arche-ingest
Informations about the package arche-ingest
A collection of ARCHE ingestion script templates
The REST API provided by the ARCHE is quite a low-level from the point of view of real-world data ingestions. To make ingestions simpler, the arche-lib-ingest library has been developed. While it provides a convenient high-level data ingestion API, it's still only a library which requires you to write your own ingestion script.
This repository is aimed at closing this gap - it provides a set of data ingestion scripts (built on top of the the arche-lib-ingest) which can be used by people with almost no programming skills.
Scripts provided
There are two script variants provided:
- Console scripts variant where where parameters are passed trough the command line.
The benefit of this variant is easiness of use, especially in CI/CD workflows.bin/arche-import-metadata
imports metadata from an RDF filebin/arche-import-binary
(re)ingests a single resource's binary content (to be used when file name and/or location changed)bin/arche-delete-resource
removes a given repository resource (allows recursion, etc.)bin/arche-delete-triples
removes metadata triples specified in the ttl file (but doesn't remove repository resources)bin/arche-update-redmine
updates a Redmine issue describing the data curation/ingestion process (see a dedicated section at the bottom of the README)
- Template variant where you adjust execution parameters and/or the way the script works by editign its content.
The benefit of this variant is that it allows to treat the adjusted script as a documentation of the ingestion process and/or adjust it to your particular needs.add_metadata_sample.php
adds metadata triples specified in the ttl file preserving all existing metadata of repository resourcesdelete_metadata_sample.php
removes metadata triples specified in the ttl file (but doesn't remove repository resources)delete_resource_sample.php
removes a given repository resource (allows recursion, etc.)import_binary_sample.php
imports binary data from the diskimport_metadata_sample.php
imports metadata from an RDF filereimport_single_binary.php
reingests a single resource's binary content (to be used when file name and/or location changed)
Installation & Usage
Runtime environment
You can also use the acdhch/arche-ingest
Docker image
(the {pathToDirectoryWithFilesToIngest}
will be available at the /data
location inside the Docker container):
Console script variant
-
Install with:
-
Update regularly with:
-
Run with:
e.g.
- To get the list of available parameters run
e.g.
Running inside GitHub Actions
Do not store your ARCHE credentials in the workflow configuration file. Use repository secrets instead (see example below).
A fragment of your workflow's yaml config may look like that:
Running on ACDH Cluster
First, get the arche-ingestion workload console as described here
Then:
- Run
screen -S mySessionName
- Go to your ingestion directory
-
Run scripts using
{scriptName}
, e.g. - If the script will take long to run, you may safely quit the console with
CTRL+a
+d
followed byexit
.- To get back to the script log again into
repo-ingestion@hephaistos
and run
- To get back to the script log again into
Template variant
- Clone this repository.
-
Run
-
Adjust the script of your choice.
- Available parameters are provided at the beginning of the script.
- Don't adjust anything below the
line until you consider yourself a programmer and would like to change the way a script works.
-
Run the script with
- You can consider reading input from a file and/or saving output to a log file, e.g. with:
(see the section below for hints on the input file format)
Long runs
If you are performing time consuming operations, e.g. a large data ingestion, you may consider running scripts in a way they won't stop when you turn your computer off.
You can use nohup
or screen
for that, e.g.:
-
nohup - run with:
- If you want to run template script variants that way, you have to prepare the input data file.
It should look as follows:
e.g.
- If you want to run template script variants that way, you have to prepare the input data file.
-
screen
-
start a
screen
session with - Then run your commands as usual
- Hit
CTRL+a
followed by ad
to leave thescreen
session. - You can get back to the
screen
session with
-
Reporting errors
Create a subtask of the Redmine issue #17641.
- Provide information on the exact location of the ingestion script location (including the script file itself) and any other information which may be required to replicated the problem.
- Assign Mateusz and Norbert as watchers.
Using arche-update-redmine in a GitHub workflow
The basic idea is to execute data processing steps in a following way:
- note down the step name so we can read it instead of a failure
- perform the step
- call the arche-update-redmine
and have a separate on-failure job step which makes an arche-update-redmine call noting the faillure.
Remarks:
- As a good practice we should include the GitHub job URL in the Redmine issue note. For that we set up a dedicated environment variable.
- It goes without saying Redmine access credentials are stored as a repository secret.
- The way you store the main Redmine issue ID doesn't matter as it's not secret. Do it a way you want (here we just hardcode it in the workflow using an environment variable)