[Logo] MyDWBI---致力于打造最专业的中文数据仓库,商务智能社区
  [Groups]首页  [Blog] 博客   [Search] 搜索   [Recent Topics] 最新主题   [Hottest Topics] 热门主题   [Hottest Download] 热门下载   [Members]  会员列表  
[Register] 会员注册 / 
[Login] 登入 
REAL OPEN CASE STUDY: remove duplicates when extract data from huge flat files and relational DB  XML
论坛首页 » Informatica
前往:   
发表人 内容
accel99cn

初级会员

注册时间: 2008-07-06 11:46:25
文章: 9
离线

Hi INFA Buddies,

I have a real life application case to implement, hope all you Informatica guys partipate in the solution/mapping design:

Scenario:
we need to merge customer data from 3 application sources to the master customer database:
2 huge flat files, about 10G size each (one is downstreamed from a financial application on mainframe computer daily to the integration/INFA server,Unix OS; another one is downloaded from another application), the 3rd customer data records reside in the table in relational database for another application.
there are duplicate records in the all sources. that means, duplicates exist in-between both flat files, and also duplicates exist in-between the either of the flat files and customer table in RDBMS.

Objective:
Design a optimal solution/mapping to extract and merge the customer data from the 3 sources, transform and loading the data to a customer master table with best performance while remove the duplicates in-between the souces.

Note:
The issues deserve attention is, these 3 customer data are all quite huge.

Welcome all buddies join the solution design, thanks all INFA guys to contribute your talent!

Ike
accel99cn

初级会员

注册时间: 2008-07-06 11:46:25
文章: 9
离线

I have an initial solution, hope some one could correct and optimize the whole mapping design, especial whenit come to the optimal performance when extract data from these huge files:



1) extraction of data



Based on performance consideration, the Flat file is a huge file, I use external loader( such as SQL*LDR of Oracle) to bulk load staging DB into staging table(CUSTOMER_LoadedfromFlatFile). This script can be setup as a command task in the workflow.



2) Mapping building:



a.
Use a SQTrans to retrieve the two table in staging DB, with a SQL query Override of Union like this:



select * from

(
(

SELECT CUSTOMER_LoadedfromFlatFile.TERRITORY_ID as TERRITORYID , CUSTOMER_LoadedfromFlatFile.CUSTOMER_ID as CUSTOMERID, CUSTOMER_LoadedfromFlatFile.ACCOUNT_NUMBER as ACCOUNT_NUMBER, CUSTOMER_LoadedfromFlatFile.CUSTOMER_TYPE as CUSTOMER_TYPE,

CUSTOMER_LoadedfromFlatFile.MODIFIED_DATE as MODIFIED_DATE

FROM

CUSTOMER_LoadedfromFlatFile

)

Union

( SELECT CUSTOMER.TERRITORY_ID as TERRITORY_ID , CUSTOMER.CUSTOMER_ID as CUSTOMER_ID, CUSTOMER.ACCOUNT_NUMBER as ACCOUNT_NUMBER, CUSTOMER.CUSTOMER_TYPE as CUSTOMER_TYPE,

CUSTOMER.MODIFIED_DATE as MODIFIED_DATE

FROM

CUSTOMER

)
) order by TERRITORYID , CUSTOMERID


Settings for good query performance in the SQTrans:

Sorted inputs: 2

Create SQL commands in Pre SQL: create B-tree index on the key fields group.

Post SQL: truncate the table and drop the index for next loading.



b.
Use a Aggregator TRANS connected to the above SQTrans to remove duplicate record.
Setting: Group by key field, set aggregate function to First() or no setting of aggregate function ; check sorted input; set proper data cache and index size to 2 times of standard calculation in informatica manual if allowed.



c.
for SCD type handling of data change: Use Dynamic cached lookup


Aggregator TRANS --> lookup target table with Dynamic cache enable --> Use a filter trans to filter out unchanged rows (setting NewLookupRow != 0) --> Use Router trans to route the updated rows and new rows --> Use 2 Update strategy trans( one to update and one to insert the target table.



Key setting:

lookup: Dynamic cache enable, associated lookup target surrogate Key port with Sequence-ID;



Router_Insert_Update:

InsertGroup: NewLookupRow = 1

UpdateGroup: NewLookupRow = 2

Upd_Insert Expression: DD_INSERT

Upd_update Expression: DD_UPDATE





Session setting:



Session edit -- > Property: set Treat source row as DATA DRIVEN



Session edit -- > mapping --> Target: Check these attibutes: SELECT, Update as Update, No Select Delete.



Set partition point at SQ Trans and Agregater.



--- End of Solution ----



Is there Any one contribute to the design for the real life scenario?
Make each great day, buddies!

Ike

Music
吉他手


中级会员
[Avatar]

注册时间: 2008-08-13 04:02:03
文章: 28
离线

学习!

音乐就是我的生命.
killandylove

论坛CTO
[Avatar]

注册时间: 2008-04-23 08:51:09
文章: 58
来自: 不透露
离线

accel99cn wrote:I have an initial solution, hope some one could correct and optimize the whole mapping design, especial whenit come to the optimal performance when extract data from these huge files:



1) extraction of data



Based on performance consideration, the Flat file is a huge file, I use external loader( such as SQL*LDR of Oracle) to bulk load staging DB into staging table(CUSTOMER_LoadedfromFlatFile). This script can be setup as a command task in the workflow.



2) Mapping building:



a.
Use a SQTrans to retrieve the two table in staging DB, with a SQL query Override of Union like this:



select * from

(
(

SELECT CUSTOMER_LoadedfromFlatFile.TERRITORY_ID as TERRITORYID , CUSTOMER_LoadedfromFlatFile.CUSTOMER_ID as CUSTOMERID, CUSTOMER_LoadedfromFlatFile.ACCOUNT_NUMBER as ACCOUNT_NUMBER, CUSTOMER_LoadedfromFlatFile.CUSTOMER_TYPE as CUSTOMER_TYPE,

CUSTOMER_LoadedfromFlatFile.MODIFIED_DATE as MODIFIED_DATE

FROM

CUSTOMER_LoadedfromFlatFile

)

Union

( SELECT CUSTOMER.TERRITORY_ID as TERRITORY_ID , CUSTOMER.CUSTOMER_ID as CUSTOMER_ID, CUSTOMER.ACCOUNT_NUMBER as ACCOUNT_NUMBER, CUSTOMER.CUSTOMER_TYPE as CUSTOMER_TYPE,

CUSTOMER.MODIFIED_DATE as MODIFIED_DATE

FROM

CUSTOMER

)
) order by TERRITORYID , CUSTOMERID


Settings for good query performance in the SQTrans:

Sorted inputs: 2

Create SQL commands in Pre SQL: create B-tree index on the key fields group.

Post SQL: truncate the table and drop the index for next loading.



b.
Use a Aggregator TRANS connected to the above SQTrans to remove duplicate record.
Setting: Group by key field, set aggregate function to First() or no setting of aggregate function ; check sorted input; set proper data cache and index size to 2 times of standard calculation in informatica manual if allowed.



c.
for SCD type handling of data change: Use Dynamic cached lookup


Aggregator TRANS --> lookup target table with Dynamic cache enable --> Use a filter trans to filter out unchanged rows (setting NewLookupRow != 0) --> Use Router trans to route the updated rows and new rows --> Use 2 Update strategy trans( one to update and one to insert the target table.



Key setting:

lookup: Dynamic cache enable, associated lookup target surrogate Key port with Sequence-ID;



Router_Insert_Update:

InsertGroup: NewLookupRow = 1

UpdateGroup: NewLookupRow = 2

Upd_Insert Expression: DD_INSERT

Upd_update Expression: DD_UPDATE





Session setting:



Session edit -- > Property: set Treat source row as DATA DRIVEN



Session edit -- > mapping --> Target: Check these attibutes: SELECT, Update as Update, No Select Delete.



Set partition point at SQ Trans and Agregater.



--- End of Solution ----



Is there Any one contribute to the design for the real life scenario?
Make each great day, buddies!

Ike



b和C你可以舍弃掉不考虑,因为inforamtica在数据巨大的情况下,性能会出现很大的瓶颈,因为aggregator和lookup都需要在server端load cache文件,相当于在处理之前,10g sieze的文件是会被分批做为文件load服务器,性能可想而知.
最好的办法来处理这些就是直接用SQL来做处理,分批量的处理。
把10g size的文件根据关键字段拆分成若干个文件,然后和已存在的customer表比较,当然必须有特定的关联信息,比如创建时间等等。
需要关注的问题
1.在10g size的文件里面是否有duplicate的数据
2.对于已经存在的Customer表本身数据量情况。
3.不要局限于工具本身,有的时候,能稳定,快速的解决问题就是最好的方法。

本人中科院高级潜水院院士,诺贝尔长期掉线奖,奥斯卡终身隐身奖!!!
[Email] [WWW] [MSN]
zdcku
嘉颖


高级会员
[Avatar]

注册时间: 2008-05-06 22:32:12
文章: 152
来自: 江苏南京
离线

accel99cn wrote:b.
Use a Aggregator TRANS connected to the above SQTrans to remove duplicate record.
Setting: Group by key field, set aggregate function to First() or no setting of aggregate function ; check sorted input; set proper data cache and index size to 2 times of standard calculation in informatica manual if allowed.



c.
for SCD type handling of data change: Use Dynamic cached lookup


Aggregator TRANS --> lookup target table with Dynamic cache enable --> Use a filter trans to filter out unchanged rows (setting NewLookupRow != 0) --> Use Router trans to route the updated rows and new rows --> Use 2 Update strategy trans( one to update and one to insert the target table.




英文不怎么好 不知道有没有理解错误
1、你用AGG把来源数据都已经distinct了,那后面就不需要Dynamic cached lookup了,你的目标表也肯定是很大的 Dynamic cached lookup性能肯定很差了。
你可能是考虑源有重复的吧,使用Dynamic cached lookup,就是先来了一条记录Smith,插入到目标表,然后又来一条还叫Smith,你用Dynamic cached lookup意思是想再lookup后得到已经插入到目标表了,然后使用你后面的INS/UPD操作吧。但据我实际操作过,Dynamic cached lookup根本不能做到这一点,虽然在联机帮助手册中写的好像能做到这一点,为此我也查过好多的资料,发现好多人都说Dynamic cached lookup做到这点不是很现实,一个类似花瓶的功能。
我觉得你在lookup之前已经把重复的使用AGG(TERRITORYID , CUSTOMERID )去掉了 所以就没必要再使用动态cache了,一点都没必要。而且也会大大降低性能。

2、你使用AGG去掉数据重复值,方法是可以,理论上讲没有任何问题,但你也强调了一点数据量很大10g的文件,换成rows,就很多了,所以我的想法不建议用AGG进行DISTINCT,用EXP trans进行distinct,首先AGG他会先把数据cache起来,慢!而EXP的性能是最高的。方法是在其中建立本地变量来检测数据(TERRITORYID , CUSTOMERID )是否变化,变化就打标志位,然后后面加FILTER trans过滤掉即可,因为之前已经排序好了。

3、你先把文件先load到staging表中,然后再从ST表中加载到infa中,这个也会浪费很多时间,直接把文件作为源,然后用UNION trans,然后用SORT组件排序,但这边要考虑的是把排序交给了infa来处理,而你的方法是让DB来处理sort,得权衡下两者性能,我觉得这么大的数据量,使用infa排序性能会好点。


学习
[MSN]
zdcku
嘉颖


高级会员
[Avatar]

注册时间: 2008-05-06 22:32:12
文章: 152
来自: 江苏南京
离线

不知道有没有说清楚 ,发个mapping的图片解释下吧
[Thumb - 绘图1.jpg]
 文件名称 绘图1.jpg [Disk] 下载
 描述 Mapping描述
 文件大小 37 Kbytes
 下载次数:  6 次

这篇文章被编辑了 1 次. 最近一次更新是在 2008-08-26 21:43:17


学习
[MSN]
 
论坛首页 » Informatica
前往:   

网站地图 |  联系我们 |   |  招聘版主 |  免责声明 |  意见建议 |  系统帮助 | 
Copyright © 2008, mydwbi.com, All Rights Reserved | Powered by JForum 2.1.8 © JForum Team