以下为《OPS Training 注释篇》的无排版文字预览,完整格式请下载
下载前请仔细阅读文字预览以及下方图片预览。图片预览是什么样的,下载的文档就是什么样的。
OPS Training
Schema brief introduction
https://kdb.homecredit.cn/display/DEVDST/Naming+Conventions
DATA_LAKE: Maintains the same table structure as the source business system. (Initial load: date_effective=1000-01-01, almost distinct data by id. Daily load: date_effective=D-1, have duplicate data)
Show partitions ----查分区第一个字段即为分区字段
存储所有时间节点数据
CORE(CORE_SEN): Almost save all historical and non-duplicate data. We need design and source tables are data_lake tables. CORE_SEN: sensitive data.
PCT_XXX_ACTL: actual data
REF_XXX: reference table 对照关系
For historical reasons, some old core tables also in schema pre_core.
Core_sen 涉及敏感信息 默认不开放全部数据
Core层存储历史所有数据
DIM(DIM_SEN): dimensional table, shared between CORE layer and DM layer. Dimension layer for both of SCD1 and SCD2 type of dimension. Public. DIM_SEN: sensitive data(Need apply).
标准维度表---公认的信息共用维度表
SCD2 设置有效期
DM_COMMON/DM_OPS…(DM_SEN): basic data mart , usually store low granularity. Public. DM_SEN: sensitive data(Need apply).
Stage: temporary table, other layers calculate the temporary tables that need to be stored.
做临时表存储 生产环境 中间存储不对外访问
Whole process
Model Design(logic mapping)→Model Review→Code Develop(3 scripts)→UT(initial & daily)→Code Review(git)→UAT(initial & daily)→PROD(XXL + CR + git merge)
模型设计(框架结构设计,记录设计思路)→review设计思路(模型组)→开发(建表语句/初始语句/daily语句)
Model Design Process
Metrics sort out: hierarchy/ granularity/ period. Output: nonstandard excel
DM/DIM tables structure: Determine what information/columns/ granularity are stored in each table.
定义列及力度 columns/技术要求字段
Columns Conventions:
dtime_load: which means Load time. It is usually system time. 数据执行时间
date_effective: It is somehow like a data age monitor attribute.
partition key(optional): use existing attribute directly, or use ptk_[attribute name] for attribute which type is time(actually we need at least cast time to date). when there is more than one hundred million rows , then we will consider partition. Dimensions: only needed for SCD2 type of dimension ,meanwhile data is quite huge. we can use date_valid_to as partition key. then it will be much easier to get the actual version of data. 加ptk前缀,用有效期做partition 截止目前有效 定义为3000/01/01 不是所有表都需要建分区 数据量决定
id_process: which means batch id and can be used to track load status and analyze error reason. 批次号 自动上传参数
Flag_deleted: which identify that given record be deleted softly . especially for some detail or transactional data. Default value 'N'= available . and 'Y'= softly deleted. 原始数据是否被删除 可以用case when写进去
src_ins_time(optional):Record inserted time from source table,
src_mod_time(optional):Record modified time from source table, optional.
src_id_creator(optional):Record created by from source table, optional.
src_id_modifier(optional):Record modified by from source table, optional. dlk中是否存在 存在尽量加进去
Output: part of wiki (only basic Information and table structure)
eg. https://kdb.homecredit.cn/display/DEVDST/lm_d_client
定义字符串类型 主键 partition key 来源于哪张表 表结构清晰
Simple combing logic and find source tables: Source table may come from DIM/DM/CORE, can verify from Hue and Wiki whether the source table has been designed. https://hue.homecreditcfc.cn/hue/
https://kdb.homecredit.cn/display/DEVDST/Dimension or Keyword + Search data
If not, we need redesign dm/dim/core. read EA or DWH package. https://ea.homecredit.net/HomeDW/
Learn the logic up to the int/in tables.eg. https://ea.homecredit.net/HomeDW/hosel2016.3_da/
DWH: owner_int.vh_/mv_/in_/int_→ Hadoop: data_lake.dlk_
Output: Source tables list, if data_lake tables not in Hadoop now, need create Jira to Batch team.eg. https://jira.homecredit.net/jira/browse/DPDCN-5684
主要是找逻辑及字段需求
数据逻辑:
已有表 ---dwh是否有相似表存在 Hadoop相似逻辑
1.在wiki上找现有表逻辑—倾向contract表之类
2.在非data lake层找相关关键字 看相关表逻辑
非现有----新表
EA上找表逻辑/DWH下载各种包 关注int层表(在hadoop data lake找相关表结构,没有表提jira DPDCN)
Design CORE tables:
Output: Logic Mapping eg.https://kdb.homecredit.cn/display/DEVDST/lm_pct_hsis_province_actl
General situation:
Table structure is basically the same as DATA_LAKE.
Use hash key as primary key: md5(concat(code_source_system, system _id)) or md5(business_id).
If have foreign keys, need to use hashkey as a foreign key as much as possible, id can also be used if performance issues are difficult to resolve. Foreign hash keys need add default value, can use the function repeat('0'.32), but still need to pay attention to the validity of foreign keys.
Necessary technical columns: flag_deleted(case when code_change_type = 'D' then 'Y' else 'N' end), date_effective, dtime_load, id_process.
Common reference table also design in Core layer, this table need not hashkey, other almost like pct tables.eg. https://kdb.homecredit.cn/display/DEVDST/lm_ref_ivr_type
Special situation
When several tables have different sources but same structure, and always used together, it is possible to union all tables into one core table.
When two DATA_LAKE tables are always used together, and one is a code list table or a simple table. We can join them together to do one core table.
If none of column in DATA_LAKE can be determined to be unique, the composite primary key can be used. eg. md5(concat(column_1, column_2,column_3,…)). 主键选择 多个值可以识别行的唯一性
Common reference table but source table from EXT or manual import, we need confirm owner and follow the process of batch team. Ask Kunbin.Zhang or read the following wiki for details. 找到owner手动维护---wiki查维护流程
https://kdb.homecredit.cn/display/DEVDST/Data+synchronization+preparation
Name Convention: Pay attention to the Data type, especially the double transpose string: cast(cast([column_name] as bigint) as string) 字段转bigint再转
https://kdb.homecredit.cn/display/DEVDST/Naming+Conventions#NamingConventions-CoreAttribute
CORE_SEN: sensitive data. eg. Phone number, Identification Number… 敏感信息放core_sen层
Core层数据—和data lake层一致/
加主键(hkp----md5函数 只在impala下存在)系统生成id没有业务意义—系统名+id
有业务意义md5(business id)
外键:逻辑相同/连不上 32个0/ 外键不能比主键多
Flag delete 用case when生成、dtime load获取当前系统时间即可
Reference表 对照关系码
Design DIM tables:
Output: Complete Logic Mapping eg https://kdb.homecredit.cn/display/DEVDST/lm_d_contract
Main points:
Need to make sure that this is a standard dimension, otherwise you just need a core table. 需要明确为大众理解的标准维度
If table needs to save historical data, need to design a SCD2 dimension table. Notice how initial load data is obtained. If the data volume is large, it is recommended to use DATE_VALID_TO as the partitioning key. 历史数据需要缓慢变化维、怎么load数据注意变化过程
Source table can not from data_lake directly. If the logic is very complex, can create temporary tables at the stage layer. core层出数据
Dim table need add default value and default row. 增加默认值和默认行 wiki查
https://kdb.homecredit.cn/display/DEVDST/Naming+Conventions#NamingConventions-DefaultValue
The number of source tables should be kept to a minimum, but still need to pay attention to the validity of foreign keys. 外键和主键对应关系
DIM_SEN: sensitive data(Need apply), create a table on DIM save sensitive data as '******'. 加密字段 对加密数据做6个星的处理 涉及加密数据可能需要两个表(加密处理&没有加密数处理)
Design DM tables:
Output: Complete Logic Mapping eg https://kdb.homecredit.cn/display/DEVDST/lm_f_call_routepoint_pqh
逻辑复杂需要加SQL语句(关联关系 数据生成 筛选关系)
Main points:
Try not to design the detail table, but rather to base the business on the low granularity aggregation of Metris. 不放detail数据 做轻度聚合 明细在core层
If detailed data is necessary but logic is simple, can only design and use core tables, if logic is complex, according to business needs and amount of data to determine how long to retain data. If there is a large amount of data but it is necessary to occasionally access all the historical data, all the historical data should be stored in dm_arch. dm_common/dm_ops still store recent data.
数据量非常大且是明细表 data mart层不能存储全部历史数据 标注好数据存储时间范围----basic information中标注清楚。历史数据需要访问但不频繁可放arch 但不向外开放
When there is a large amount of data, the choice of partitioning key is very important, and it is necessary to use the common business date as the partitioning key (Ease of updating also n 内容过长,仅展示头部和尾部部分文字预览,全文请查看图片预览。 iewpage.action?spaceKey=IBD&title=ETL+Script+Deployment
Wiki: https://kdb.homecredit.cn/display/DEVDST/Model+specification+of+Hadoop+Platform
Data dictionary
YARN: http://prod-hadoop-master-01.homecreditcfc.cn:8088/cluster/apps/RUNNING
Can be used to check the progress of a task and report the cause of an error.
GitLab: https://git.homecreditcfc.cn/it-dev/dst-modeling
Upload the script for code review
Jupyterhub: https://jupyterhub.homecreditcfc.cn/ Can be used to debug Python/Scala code.
[文章尾部最后500字内容到此结束,中间部分内容请查看底下的图片预览]请点击下方选择您需要的文档下载。
以上为《OPS Training 注释篇》的无排版文字预览,完整格式请下载
下载前请仔细阅读上面文字预览以及下方图片预览。图片预览是什么样的,下载的文档就是什么样的。