Extractor API

The Extractor API defines the interfaces that are used to extract all the entities of the CP. All extractors should implement the interface defined below:

  • The init method is invoked to initialize the connections to a specified data source. It also loads all the entity configurations from the metadata files to the CP folder. The endpoint argument defines the necessary information to connect to the data source.
  • The checkConnection method is called to check whether the connection to the data source can be established or not.
  • The extract method is responsible for doing the actual extraction work for all the entities. The entities can be extracted in sequence or in parallel, depending on the implementation of the extractors. The status is returned once the task is finished or failed.
  • The abort method supports the abort extraction functionality during the extraction. If the execution was aborted, it returns true; otherwise, it returns false.

The BaseExtractor implements the IExtractor interface and handles all the common tasks for all extractors, such as initializing the extractor, loading the extraction model and source models, loading the settings for each extractor, handling the status persistence and so on. So that it is recommended to implement customized extractors based on the BaseExtractor.

Generally, all the extractors that extends BaseExtractor should overwrite the following methods:

1. getPlatformVersion()
2. checkConnection()throws ExtractorException
3. doExtract(String batchId, Map<String,String> lastModifiedMap, List<DcsEntity> entities) throws ExtractorException, InterrruptedException

The getPlatformVersion method is used to indicate which platform the current extractor is targeted to be based on. The version number should have 3 parts: <major version>, <minor version> and <patch version>. For ITBA 10.10, the platform version is 10.10.0.

The checkConnection and the doExtract methods are the most important methods that you must implement:

  • The checkConnection is used to test the connection when adding new data sources.
  • The doExtractextracts the data when you click the start ETL or ETL is triggered by the scheduler.

Every check connection or data extraction is a separate process and cannot have an impact on other processes. This means that you cannot share the fields for different batches of extractor execution. For example, if you want to count the times of execution in your extractor, you must defined\ a non-static field named count and increase it in the doExtract method. You cannot get the correct result because for each extraction, a new extractor class instance is created and the count value is always 0 at the beginning.

The extractor class is dynamically loaded by the DCS framework. So that you can easily replace the extractor .JAR file to make your changes work immediately in the next batch execution.