You are here:

depa.store

This service provides access to document artifacts stored in Depa.Tech. Use this service, for example, to download, PDF files, XML files and individual page files.

General

The Depa.Tech Content Delivery Service (CDS) is a REST service that provides access to information stored in the Depa.Tech Document Store. Depending on the publishing office, we stored a variety of artifacts, some directly from the office itself, others generated during the import process.
It is possible to download either single artifacts, all artifacts of a single document or all artifacts of a list of documents. Multiple artifacts/documents are downloaded as ZIP archives.

Information Required

The only information required to access the CDS is a valid Depa.Tech document ID (such as DE.000112014003467.A5). Accessing the endpoint of a document directly will return a list, in JSON format, of all artifacts we currently hold for that particular document.

Authentification

To use the CDS, authentication is required, i.e. customers need an account for the depatech-proxy. Please send an email to marc.haus@mtc.berlin if you need an account.

Naming Conventions

To allow for uniform access to the various publishing offices that Depa.Tech supports, a naming convention has been employed for common artifacts. For example, the PDF is always called DOCUMENT.PDF, the XML DOCUMENT.XML, and the individual page files PAGEnnnn. This conforms to the standard DEPAROM naming conventions. See http://deparom.de for more information about DEPAROM. A DEPAROM client is not required to use CDS.

Artifacts Stored in CDS

Currently, Depa.Tech holds the following artifacts:

  • Office XML
    The XML source from the publishing office contained, at minimum, the bibliographic data. For some offices, full text is available. See table below.
  • Office PDF
    The PDF file from the publishing office.
  • PAGEnnnn files
    Each page as single page TIFF in CCITT format (single bit, black and white) at 300 DPI. Generally rendered from the Office PDF.
  • Embedded Images
    Images and drawings in TIFF format. Naming convention depends on publishing office. Used in combination with the XML.
  • MTC JSON
    The document XML in JSON format. The artifact name is mtc.json.
  • MTC Simple JSON
    A simplified JSON format that complies more closely with DEPAROM. Generally speaking, for most uses the MTC JSON format is more appropriate since it is more. The artifact name is mtc.simple.json.
  • MTC Artifacts JSON
    A file in JSON format containing a list of all available artifacts. The file is called artifacts.json. Is similar is usage to what is returned by the List Artifacts endpoint. This file also contains structural information about what document sections are on which pages.
  • MTC Sources JSON
    A file in JSON format with information about which sources where used for this document. One or more files with the name mtc_source.json or mtc_source_<source>.json if a document has more than one source.

REST API

General Error Responses

The REST API returns the following error codes, these are listed here. Other responses are listed in the tables further down.

HTTP CodeReasonComments
404 Document Not Found Returned if a document is requested that is not in the store of if the document ID is malformed. Also returned for non-existent artifacts.
500Internal Server Error This error code indicates that something went wrong during the request. If errors persist, please contact MTC.

Service Endpoints

ActionVerbPathBodyResponseComment
List Artifacts GET/cds/:docid
Returns a JSON response
200 OK
:docid is a Depa.Tech document ID. List availailable artifacts of a document.
Download Artifact GET/cds/:docid/:artifact Response depends on artifact mime type 200 OK
Downloads the artefact directly. :artifact is the name of the artifact as listed in the artifacts.json file or the List Artifacts endpoint.
Download all Artifacts GET/cds/zip/:docid A ZIP stream containing all artifacts of a document. 200 OK
Downloads all available artifacts of a document as a ZIP file. The filename is generated and starts with cds-
Bulk Download POST/cds/zip

Request must contain a JSON body describing the documents to download.
Response is a ZIP stream.
200 OK
Downloads all artifacts of all documents requested as a single ZIP file. The artifacts of each document are stored in a separate folder.
If a document is not available, then that document will be missing the ZIP archive. There will be no error code in this case.

JSON Formats

Response from List Artifacts

FieldFormatUsage / Comments
artifactsJSON array of strings Contains file names within the requested directory
containerString
Actual container name in the store
docidStringdocid without revision suffix

Example


{
    "container": "DE.000112014003467.A5",
    "docid": "DE.000112014003467.A5",
    "artifacts": [
        "artifacts.json",
        "DOCUMENT.PDF",
        "DOCUMENT.XML",
        "mtc.json",
        "mtc.simple.json",
        "mtc_source.json",
        "PAGE0001"
    ]
}


		

Request JSON for Bulk Download

FieldFormatUsage / Comments
docidJSON array of strings List of docids. All documents will be added to the ZIP file.

Example


{
    "docids": [
        "EP.000000002678869.A2",
        "DE.000202011110740.U1"
    ]
}


		

MTC Artifacts JSON

FieldFormatUsage / Comments
idJSON String Depa.Tech ID of the corresponding document.
artifacts List of JSON Strings List of available artifacts, as returned by the List Artifacts endpoint.
sections List of JSON Dictionaries
For each dictionary, the key is the section name and the fields start and end denote the start and end page numbers of that section.

Depa.Tech supports the following sections:

  • Title
    The title page(s) of the document

  • Abstract
    The pages the contain the abstract. Usuall the same as Title and is sometimes missing, depending on the data source.

  • Drawing
    The pages that contain drawings. Is not always present, for example if the document has no drawings.

  • Claim
    The pages that contain the claims. Is not always present.

  • Description
    The pages that contain the description. Is not always present.

The presence of a Claims or Description section does not necessarily mean that the XML document is full text.

Example


{
    "id": "DE.000112014003467.A5",
    "artifacts": [
        "DOCUMENT.PDF",
        "DOCUMENT.XML",
        "mtc.json",
        "mtc.simple.json",
        "mtc_source.json",
        "PAGE0001",
        "artifacts.json"
    ],
    "sections": [
        {
            "section": "Title",
            "start": 1,
            "end": 1
        }
    ]
}


		

MTC Sources JSON

FieldFormatUsage / Comments
sourceUri
JSON String Location of the source during input
container JSON String Name of the store container
entries
List of JSON Strings List of files used during import
inputTime
TimestampTime document was imported
hostJSON String Name of host document was imported on
userJSON String Name of user that imported the document
sourceNameJSON String Name of the source in Depa.Tech speak

Example


{
    "sourceUri": "file:/usr/local/mtc/depatech/iso/DPMAdatenabgabe/2016/2016015/ST36/DEA2016015/",
    "container": "DE.000112014003467.A5",
    "entries": [
        "DE112014003467A5.xml"
    ],
    "inputTime": 1476179332932,
    "host": "depatech01.mtc.berlin",
    "ip": "192.168.102.189",
    "user": "depatech",
    "sourceName": "DEA2016015"
}