Talend & Pentaho Data Integration

Posts

Showing posts from 2013

Pentaho Instaview Demo

November 21, 2013

Pentaho Instaview is a new Enterprise Edition perspective in Pentaho Data Integration version 4.4 that allows you to quickly connect to a data source, define a slice of data to work with, and immediately visualize that data in Pentaho Analyzer Check out the demo here: http://www.pentaho.com/resources/videos/97/pentaho-instaview-demo

Difference between Jaspersoft studio & iReport

November 18, 2013

In the FAQ of http://www.jaspersoft.com : Why is Jaspersoft Doing this? For years our community of developers asked us to support the Eclipse platform due to its popularity and capabilities. This feedback made the decision to build an Eclipse-based report designer easy. Jaspersoft users will benefit from the rich capabilities of the Eclipse platform and Eclipse developers will benefit from a complete open source BI stack to build and deploy their reports. We also aim to create a report design environment that is both powerful and intuitive so that it appeals to both the advanced and the first-time report developer. So I think this is simply a fork with the goal to provide the designer as an eclipse based application. Plus they also provide the designer as a plugin version for Eclipse: Which Eclipse releases does Jaspersoft Studio work with? The plugin version of Jaspersoft Studio can be installed on Eclipse IDE 3.5 or later. The compatible Eclipse releases are Indigo,

Execution failed : Failed to generate code

October 25, 2013

This sort of error may occur in Talend making the Talend not boot properly & displaying the error. The reason for this error: There exist a job in the workspace which is not saved but the Talend application shut down unexpectedly. Resolution: 1. Create a new folder for another workspace. Change your current workspace folder associated at the start screen. 2. Import your jobs from the old workspace by using 'Import items' option.

Different Match Models in tMap with example

October 08, 2013

Unique Match: (Also called Last match) If several matches are found in the Inner Join, i.e. the last row matching the explicit join as well as the filter will be added to the output flow. First Match: If several matches are found in the Inner Join, i.e. the first row matching the explicit join as well as the filter will be added to the output flow. All Match: If several matches are found in the Inner Join, i.e. rows matching the explicit join as well as the filter, all of them will be added to the output flow Example: deptid sales 1 1000 2 2000 3 3000 2 5000 3 6000 Outputs: 1. Unique match deptid sales 1 1000 2 5000 3 6000 2. All matches deptid sales 1 1000 2 2000 3 3000 2 5000 3 6000 3. First match deptid sales 1 1000 2 2000 3 3000

Jaspersoft 5.1 features

October 01, 2013

Improved Language Support with the web user interface now available in Brazilian Portuguese. New Pro Maps, Charts, and Widgets functionality - These elements can now be rendered as HTML5, allowing you to view Fusion content rendered as HTML5. The charts, maps, and widgets are greatly improved, and provide a better visual experience. Improved Chart Export - The chart rendering export engine has been re-written to eliminate issues with Flash objects not showing properly in export formats such as PDF. Exported charts are created as images during export, which provides visual consistency when viewing charts on various devices. Note that Jaspersoft sometimes recommends using PhantomJS as the report renderer, which can improve output or performance in some circumstances. For more information, see the JasperReports Server Administrator Guide. Ad Hoc Chart Formatting - The Ad Hoc Editor now provides basic chart formatting options to help you make your charts more readable, such as x-a

Introducing Pentaho 5.0

October 01, 2013

Pentaho 5.0 includes more than 250 new features and improvements. Highlights include: Simplified analytics and user experience New Pentaho User Console and streamlined user interface Re-designed experience for administrators Industry leading operational reporting for MongoDB Enterprise-ready big data integration Over 100 new features in Pentaho Data Integration New functionality to help IT manage huge data volumes efficiently Simplified Embedded Analytics New REST services for third-party applications New capabilities to blend big data “at the source” Architected data blending for complete and accurate analytics Deliver “analytics ready” blended data to any user

Running out of memory with tAggregateRow component

October 01, 2013

There can be situations where i have to aggregate millions of records to perform aggregate functions. Thus in such case you have two options: 1. Increase the java heap space/memory 2. Sort the incoming rows using tSortRow and then use tAggregateSortedRow which will save the processing time used.

Null pointer exception in tmap

August 14, 2013

You might have faced a common error in Talend i.e. the NullPointerException in tMap "Exception in component tMap_1 java.lang.NullPointerException" Solutions: 1) Make sure your columns are nullable when you define the schema. 2) The NullPointerException indicate there are some null value on lookup table, so you need to deal with the null value on corresponding columns on expression filed of tmap, for example: !Relational.ISNULL(row13.abc)? row13.xyz:row13.sss:null

Generating Date Dimension

August 13, 2013

Create a new job called date_dim . First we will define a variable for start date, because we will use this job in various projects and we might require a different start date each time: Click on the Context tab and then on the + button to add a new context variable. Give it the name myStartDate of type Date and define a value for it. Next add a tRowGenerator component to the design area and double click on it to activate the settings dialog. The idea is to create X amount of rows: The first row will hold our start date and each subsequent row will increment the date by one day. 1. Click the + button to add a new column. Name it date and set the type to Date . 2. Click in the Environment variables cell on the right hand side and then you will see the parameters displayed in the Function parameters tab on the bottom left hand side. 3. Define the number of rows that should be generated in Number of Rows for RowGenerator. 4. In t

Difference between 'Insert or Update' and 'Update or Insert'

July 30, 2013

Insert or Update: First tries to insert a record, but if a record with a matching primary key already exists, instead updates that record. Update or Insert: First tries to update a record with a matching primary key, but if none already exists, instead inserts the record. From a results point of view, there are no differences between the two, nor are there significant performance differences. In general, choose the action that matches what you expect to be more common: Insert or Update if you think there are more inserts than updates, Update or Insert if you think there are more updates than inserts.

Executing multiple commands in tSystem component

July 30, 2013

On Windows On Windows, you can execute multiple system commands at one time on a tSystem component using this format, the different commands are connected with & symbol. cmd / c mkdir dir1 & mkdir dir2 On Linux On Linux, you can execute multiple system commands at one time on a tSystem component using this format, the different commands are connected with ; symbol. touch file1.txt ; touch file2.txt

Version Control with Talend Open Studio

July 26, 2013

There is a feature by which we can apply version control to our jobs in Talend. By default when job is first created it has a version 0.1 attached to it. To alter the version or create a new version for the job, follow these steps 1. Right click the job created in the repository 2. Select 'Open another version'. 3. In the new window increment the version of the job using 'm' & 'M' tabs depending upon minor or major version. 4. Check 'Create new version & open it ?' check box 5. New version of the job is opened. 6. If you want to reopen previous version of job, follow the same steps except don't check 'Create new version & open it ?'

Scheduling in PDI

July 17, 2013

Once you're finished designing your P DI jobs and transformations, you can arrange to run them at certain time intervals through the DI Server , or through your own scheduling mechanism (such as cron on Linux, and the Task Scheduler or the at command on Windows). The methods of operation for scheduling and scripting are di fferent; scheduling through the DI Server is done through the Spoon graphical interface, whereas scripting using your own scheduler or executor is done by calling the pan or kitchen commands. This section explains all of the details for scripting and scheduling P DI content. You can schedule your jobs through: 1. Data Integration (DI) Server 2. Manual scripting through pan or kitchen commands 1. DI Server This method is done through the Spoon graphical interface & is only available for Enterprise repository After you design your job, the steps are as follows: 1. Open a job or transformation, then go to the

34 Subsystems of ETL

June 26, 2013

In this, and in the next series of posts, I will be exploring the 34 subsystems of ETL Data Integration as defined by the Kimball Group. I introduce the subsystems in this post, and then I will discuss how each fits (or does not fit) into Talend & PDI . The subsystem concept is a best-practice initiative formulated by The Kimball Group to help organizations design effective and efficient Data Integration environments for Data Warehousing using the Dimensional Model. The Kimball Group categorizes the subsystems into 4 distinct groups: Data Extraction, Cleansing and Conforming Tasks, Data Delivery, and Management. Data Extraction 1. Data Profiling Talend: Talend has a separate tool for data profiling & data quality called 'Talend Open Studio for Data Quality' Pentaho: 'DataCleaner' plugin is available for download for this purpose 2. Change Data Capture (CDC) Talend: Talend has a inbuilt trigger based CDC feaature which can be applied easily. (En

Talend Certified Consultant

June 14, 2013

Talend - Writing multiple sql queries in tOracleRow component

March 21, 2013

Typically a row component in Talend or JasperETL is used to execute only one sql statement. There might be a situation where you want to write multiple sql statements inside a single component.. To acheive this, you simply embed ur multiple queries in a single BEGIN.....END; clause. EXAMPLE: "begin Insert into ERRORS(Errorcode,error_category,errordescription) values(1,'ERR2','FATAL ERROR'); Insert into ERRORS(Errorcode,error_category,errordescription) values(2,'ERR2','WARNING'); Insert into ERRORS(Errorcode,error_category,errordescription) values(3,'ERR2','CRITICAL ERROR'); commit; end; "

Talend: Error Recovery

January 29, 2013

Error recovery is a feature of Enterprise edition of Talend called Talend Integration Suite(TIS). Error recovery is a mechanism which allows the user to start the job at a particular point in case of error while executing the job. Steps to activate Error reecovery: 1. you should initiate as “checkpoints” one or several OnSubjobOk trigger connections while designing the job. Make sure you have a remote repository to make this functionality work. Accessing Error recovery Management from the Job Conductor page > On the toolbar, click Recover last execution to display the Error Recovery Management page. This page presents two horizontal parts: on the upper part, the Task execution monitoring list and, on the lower part: Execution Info and Recovery checkpoints tabs. See the following sections for detail description of the views associated with these tabs.