TC6.9 - ETL processes for multiprotocol and format data ingestion, see on GITHUB for library

Test Case Title

TC6.9 - ETL processes for multiprotocol and format data ingestion, see on GITHUB for library

Goal

Access to a large library of ETL process for smart city, from GITHUB connection with dashboards via HTTPS/HTTP, LDAP

Verify that the ETL can manage multiple protocols and multiple formats.

Prerequisites

Using a PC or Mobile with a web browser.

Expected successful result

List of a set of ETLs on the GitHub portal.

Steps

 

See Supported Protocols for the full list of supported protocols.

Please note that to correctly perform this Test Case you need and access to the ETL Virtual Machine as described below. To have access to a Virtual Machine to perform ETL please contact snap4city@disit.org.

 

Users roles for which this test case is available: public user (all people)

Example 1: Access to a large library of ETL process for smart city, from GITHUB


Fig. – Disit GitHub home page.

 

Fig. – Disit GitHub search for ‘snap4city’.
 


Example 2: Verify that the ETL can manage multiple protocols and multiple formats

 
To verify that the ETLs are capable to manage MULTI PROTOCOLS and MULTI format, it can be useful see the table present in the TC6.5-ETL-heterogeneous, in which a detailed list of the snap4city ETLs with the specified different formats and protocols managed is made. Moreover, it is possible to open some ETLs (here after only some of them, but the action can be done on all the ETLs available) I order to verify this feature.
 
ETL1): Helsinki_youth_subsidies (Protocol: HTTP, data input format: xls, data output format: csv)

  • A set of samples are already present in the VM, for example we can choose the ‘Helsinki_youth_subsidies’ that is capable to: i) download a dataset; ii) transform it; iii) save the new data in a file (into the VM file system). To verify the ETL functionalities follow the steps:
    • Click on the menu and select: ‘File > Open’ and select the file from: ‘Ubuntu/Desktop/snap4cityETL/Helsinki_youth_subsidies_XLS/Ingestion/Maink.kjb’
    • In the Spoon Interface the Main Spoon Job appears.


Fig.: Spoon Interface: open the ‘Helsinki_youth_subsidies_XLS’ ETL.


Fig.: Spoon Interface:  the ‘Helsinki_youth_subsidies_XLS’ ETL opened.

  • To run the ETL process, click on the main ‘Run’ button (), a popup will open with a set of default values (including the url from which to download the dataset: ‘’), click on the Second ‘Run’ button.


Fig.: Spoon Interface: launch an ETL (‘Run button’).

 

  • Once launched the process, a set of windows will open. One relevant is the ‘Logging window’ from which it is possible to read the logs. When the logging windows report the sentence ‘‘Spoon – job has ended’, the job is ended with no errors.


Fig.:  Spoon Interface: ETL executes, logging details.

  • Open the ‘HTTP’ step to view how the connection with the web server it is realized:
    • Click on the ‘HTTP’ Icon (right click) and select ‘edit’.

Fig.:  ETL execution: edit the HTTP step.

  • In the URL field is present the direct url to download the dataset (or a parameter that can come from the job settings, from the Database, from a configuration file, etc.)
  • In the Target file field is present the file path into which the data coming from the web server are written


Fig.:  ETL execution: HTTP step details.

  • What are the actions done by this process and how it is possible to verify them? Some block, can be expanded, with double left click or 1 right click of the mouse.
    • Activities:
      • block ‘Start’: start the process (see its details with: ‘right click > edit’),
      • block ‘Create original data folder’:
        • if not exists, creates a new folder for the downloaded file: ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Original_data’
      • block ‘HTTP’: makes the HTTP REST request (to ‘http://www.nk.hel.fi/avoindata/avoin_data-avustukset.xls’) and saves the xls file (named ‘avoin_data-avustukset.xls’) in the folder created in the previous block
      • block ‘File Exists’: checks if the uploaded file is already present
      • block ‘Create Elaborated data folder’: ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Elaborated_data’’
      • block ‘DUMMY’: do nothing
      • block ‘process data’: writes the downloaded file, made data analysis and save the result in a new file in the folder ‘‘/home/ubuntu/Desktop/ETL_Output/Helsinki_youth_subsidies_XLS/Elaborated_data/Result_data.csv’
      • block ‘Success’: ends the process.
    • How to verify:
      • Search the folder into the file system and verify the presence of the files, you can also delete them, and launch again the ETL process.


Fig.: How to verify the Helsinki_youth_subsidies_XLS Results.

 

ETL2: Florence_School_Canteen_FTP (Protocol: FTP, input file format: shape)

  • Follow the same steps viewed in the previous case of ‘Helsinki_youth_subsidies_XLS’
  • Open the main job:
    • An FTP step is present, in this case the data come from the Disit FTP and are downloaded in the Virtual machine



Fig.: Details on the FTP step.

ETL3: Florence_firstAid_accesses_HTML (Protocol: HTTP, input file format: html, file output format: csv, other format used in the ETL to transform data: xml)

  • Follow the same steps viewed in the previous case of ‘Helsinki_youth_subsidies_XLS’ and open the Open the main job (‘Florence_firstAid_accesses_HTML/Ingestion/main.kjb’)
  • In the ‘Download_data’ Trasformation is present:
    • an HTTP step, downloading the data from a web server


Fig.: Details on HTTP step.
 

  • A set of transformations o the strings
  • A block saving the data in an xml file


Fig.: data saved in an xml file.

 

  • In the ‘process data’ transformation is present:
    • a set of blocks working on strings starting from the xml file downloaded 
    • a final step saving the transformed data in a csv output file
       

      Fig.: transformations on data saved in a cv output file.

ETL4: via_francigena_farmhouse_GeoJson (Protocol: HTTP, input file format: json, file output format: csv)
Follow the same steps viewed in the previous case of ‘Helsinki_youth_subsidies_XLS’ and open the Open the main job (‘via_francigena_farmhouse_GeoJson/Ingestion/main.kjb’)

  • In the ‘Download_data’ Transformation is present an HTTP step, downloading the data from a web server

 

  • In the ‘process data’ transformation:
    • a set of blocks working on strings starting from the json file downloaded 
    • a final step saving the transformed data in a csv output file 


Fig.: transformations on data saved in a cv output file.