ETL processes for massive Data Ingestion and Transformation

 

 

ETL_name, file format

Type

Protocol/ Standard

Data License

Link on https://www.snap4city.org/download

Florence_firstAid_accesses_HTML

HTML

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Florence_firstAid_accesses_HTML/

Florence_Parking_JSON (static & realTime)

Json

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Florence_Parking_JSON/

Florence_Weather_XML

XML

HTTP

Arpat, Tuscany region

https://www.snap4city.org/download/snap4cityETL/Florence_Weather_XML/

Florence_Pharmacies_CSV

Csv

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Florence_Pharmacies_CSV/

Helsinki_youth_subsidies_XLS

XLS

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Helsinki_youth_subsidies_XLS/

Electric_vehicle_charging_kmz

Kmz

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Electric_vehicle_charging_kmz/

Bike_Sharing_Areas_Shp

Shape

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/Bike_Sharing_Areas_Shp/

Tpl_bus_gtfs

gtfs

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/TPL_bus_gtfs/

Smartbench

OneM2M

M2M

Under Comune di Firenze authorization

https://www.snap4city.org/download/snap4cityETL/Smartbench_M2M/

LinkedData

N3

HTTP

Open data

https://www.snap4city.org/download/snap4cityETL/LinkedData/

Florence_School_canteen

Shape

FTP

Open data

https://www.snap4city.org/download/snap4cityETL/Florence_School_Canteen_FTP/

Tuscany_parking

Datex II

SOAP,

 Rest API

Under MIIC authorization

https://www.snap4city.org/download/snap4cityETL/Tuscany_parking_Datex%20II/

via_francigena_farmhouse_GeoJson

GeoJson

Rest API, JDBC

Under Regione Toscana authorization

https://www.snap4city.org/download/snap4cityETL/via_francigena_farmhouse_GeoJson/

sigFOX

SigFOX

https

DISIT lab

https://www.snap4city.org/download/snap4cityETL/SigfoxSM/

From_KM4cityKB_to_Datagate

From HBASE to csv

HTTP

DISIT lab

https://www.snap4city.org/download/snap4cityETL/from_KM4cityKB_to_Datagate/

 

ETL

Source:

ETL Description

Florence_firstAid_accesses_HTML

 

http://www.asf.toscana.it/estar/accessi-internet.php

This ETL (static)

  • Read data coming from a web page (html)
  • Write the data as they are downloaded in the VM file system
  • Transform the data from html to a csv file with only the interesting information

Florence_Parking_JSON (static & realTime)

http://opendata.comune.fi.it/od/ParkInfo_Firenze_SMN.json

This ETL is composed of two phases:

STATIC phase:

  • Read data coming from a web server in a JSON format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from html to a csv file with only the STATIC data (Phone number, fax, address, city, country, zipcode, Opening hours, parking fee, available places, latitude, longitude)

REAL TIME phase:

  • Read data coming from a web server in a JSON format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from html to a csv file with only the REAL TIME data (free spot, update date)

 

Florence_Weather_XML

Arpat, Tuscany region

This ETL (REAL TIME):

  • Read data coming from a web server in an xml format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from html to a csv file with only the REAL TIME data (sunrise, sunset, sun_altitude, min_temperature, max_temperature, etc.) and add some information (date)

Florence_Pharmacies_CSV

http://opendata.comune.fi.it/od/turno_farmacie.csv

This ETL (static)

  • Read data coming from a web server in a csv format
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write the results on a csv file with only some relevant data (e.g. Pharmacy name, street, civic number, opening times, etc.)

Helsinki_youth_subsidies_XLS

http://nk.hel.fi/avoindata/avoin_data-avustukset.xls

This ETL (static)

  • Read data coming from a web server in a XLS format
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write the results on a csv file containing (e.g. Beneficiary name, Office, task, etc.)

Electric_vehicle_charging_kmz

http://datigis.comune.fi.it/kml/ColonnineRicarica.kmz

This ETL is composed of two phases:

REAL TIME phase:

  • Read data coming from a web server in a kmz format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from kmz to a csv file with only the STATIC data (charging station identifier)

REAL TIME phase:

  • Read data coming from a web server in a kmz format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from kmz to a csv file with only the REAL TIME data (charging station status)

Bike_Sharing_Areas_Shp

http://datigis.comune.fi.it/shp/bike_sharing.zip

This ETL (static)

  • Read data coming from a web server in a Shape (SHP) format
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write the results on a csv file containing static data (e.g. identifier, address, station name, etc.)

Tpl_bus_gtfs

http://dati.toscana.it/dataset/8bb8f8fe-fe7d-41d0-90dc-49f2456180d1/resource/954f2767-b803-4a41-baaa-37ccc8beb163/download/amvbus.gtfs

This ETL (periodic)

  • Read data coming from a web server in a GTFS format
  • Write the (unzipped) data as they are downloaded in the VM file system (gtfs standard files: agency.txt, calendar_dates.txt, routes.txt, shapes.txt, stops.txt, stop_times.txt, trips.txt)

Smartbench

https://icon-lab.tim.it/onem2m/firenze/smartbench/bench1/la

This ETL (REAL TIME):

  • Read data coming from a web server in a json format
  • Write the data as they are downloaded in the VM file system
  • Transform the data from json to a csv file with only the REAL TIME data (creation time, temperature, humidity, pressure, transits, sittings, pollution, etc.)

LinkedData

http://linkeddata.comune.fi.it:8080/all

This ETL (static)

  • Read data coming from a web server in rdf
  • Write the data as they are downloaded in the VM file system (triples in rdf)

Florence_School_canteen

Disit FTP

This ETL (static)

  • Read data coming from a web page in shape
  • Write the data as they are downloaded in the VM file system (unzipped: dbf, fix, prj, qix, shp, shx files)

Tuscany_parking

http://www501.regione.toscana.it/osservatoriotrasporti/

This ETL (static)

  • Read data coming from a web server in shape
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write them in an n3 file (triples connecting the dataset to the KM4city Ontology)

via_francigena_farmhouse_GeoJson

http://www306.regione.toscana.it/

This ETL (static)

  • Read data coming from a web page (geojson)
  • Write the data as they are downloaded in the VM file system
  • Transform the data and write the results on a csv file containing static data (e.g. identifier, name, province, municipality, type, coordinates, etc.)

sigFOX

https://backend.sigfox.com/api/devices/

This ETL (REAL TIME):

  • Read data coming from a web server in a json format
  • Write the data as they are downloaded in the VM file system (data, device id, temperature, coordinates, etc.)

From_KM4cityKB_to_Datagate

Disit knowledge BASE

This ETL (REAL TIME):

  • Make a query on the Disit database (HBase)
  • Download a set of data and: i) write them in the Virtual Machine file system; ii) use the DataGate API and publish the data as a new new file in a dataset on the Datagate Portal