OPS Training 注释篇

本文由用户“ahwhynxxxx”分享发布 更新时间:2021-05-06 15:16:04 举报文档

以下为《OPS Training 注释篇》的无排版文字预览,完整格式请下载

下载前请仔细阅读文字预览以及下方图片预览。图片预览是什么样的,下载的文档就是什么样的。

OPS Training

Schema brief introduction

https://kdb.homecredit.cn/display/DEVDST/Naming+Conventions

DATA_LAKE: Maintains the same table structure as the source business system. (Initial load: date_effective=1000-01-01, almost distinct data by id. Daily load: date_effective=D-1, have duplicate data)

Show partitions ----查分区第一个字段即为分区字段

存储所有时间节点数据

CORE(CORE_SEN): Almost save all historical and non-duplicate data. We need design and source tables are data_lake tables. CORE_SEN: sensitive data.

PCT_XXX_ACTL: actual data

REF_XXX: reference table 对照关系

For historical reasons, some old core tables also in schema pre_core.

Core_sen 涉及敏感信息 默认不开放全部数据

Core层存储历史所有数据

DIM(DIM_SEN): dimensional table, shared between CORE layer and DM layer. Dimension layer for both of SCD1 and SCD2 type of dimension. Public. DIM_SEN: sensitive data(Need apply).

标准维度表---公认的信息共用维度表

SCD2 设置有效期

DM_COMMON/DM_OPS…(DM_SEN): basic data mart , usually store low granularity. Public. DM_SEN: sensitive data(Need apply).

Stage: temporary table, other layers calculate the temporary tables that need to be stored.

做临时表存储 生产环境 中间存储不对外访问

Whole process

Model Design(logic mapping)→Model Review→Code Develop(3 scripts)→UT(initial & daily)→Code Review(git)→UAT(initial & daily)→PROD(XXL + CR + git merge)

模型设计(框架结构设计,记录设计思路)→review设计思路(模型组)→开发(建表语句/初始语句/daily语句)

Model Design Process

Metrics sort out: hierarchy/ granularity/ period. Output: nonstandard excel

DM/DIM tables structure: Determine what information/columns/ granularity are stored in each table.

定义列及力度 columns/技术要求字段

Columns Conventions:

dtime_load: which means Load time. It is usually system time. 数据执行时间

date_effective: It is somehow like a data age monitor attribute.

partition key(optional): use existing attribute directly, or use ptk_[attribute name] for attribute which type is time(actually we need at least cast time to date). when there is more than one hundred million rows , then we will consider partition. Dimensions: only needed for SCD2 type of dimension ,meanwhile data is quite huge. we can use date_valid_to as partition key. then it will be much easier to get the actual version of data. 加ptk前缀,用有效期做partition 截止目前有效 定义为3000/01/01 不是所有表都需要建分区 数据量决定

id_process: which means batch id and can be used to track load status and analyze error reason. 批次号 自动上传参数

Flag_deleted: which identify that given record be deleted softly . especially for some detail or transactional data. Default value 'N'= available . and 'Y'= softly deleted. 原始数据是否被删除 可以用case when写进去

src_ins_time(optional):Record inserted time from source table,

src_mod_time(optional):Record modified time from source table, optional.

src_id_creator(optional):Record created by from source table, optional.

src_id_modifier(optional):Record modified by from source table, optional. dlk中是否存在 存在尽量加进去

Output: part of wiki (only basic Information and table structure)

eg. https://kdb.homecredit.cn/display/DEVDST/lm_d_client

定义字符串类型 主键 partition key 来源于哪张表 表结构清晰

Simple combing logic and find source tables: Source table may come from DIM/DM/CORE, can verify from Hue and Wiki whether the source table has been designed. https://hue.homecreditcfc.cn/hue/

https://kdb.homecredit.cn/display/DEVDST/Dimension or Keyword + Search data

If not, we need redesign dm/dim/core. read EA or DWH package. https://ea.homecredit.net/HomeDW/

Learn the logic up to the int/in tables.eg. https://ea.homecredit.net/HomeDW/hosel2016.3_da/



DWH: owner_int.vh_/mv_/in_/int_→ Hadoop: data_lake.dlk_

Output: Source tables list, if data_lake tables not in Hadoop now, need create Jira to Batch team.eg. https://jira.homecredit.net/jira/browse/DPDCN-5684

主要是找逻辑及字段需求

数据逻辑:

已有表 ---dwh是否有相似表存在 Hadoop相似逻辑

1.在wiki上找现有表逻辑—倾向contract表之类

2.在非data lake层找相关关键字 看相关表逻辑

非现有----新表

EA上找表逻辑/DWH下载各种包 关注int层表(在hadoop data lake找相关表结构,没有表提jira DPDCN)

Design CORE tables:

Output: Logic Mapping eg.https://kdb.homecredit.cn/display/DEVDST/lm_pct_hsis_province_actl

General situation:

Table structure is basically the same as DATA_LAKE.

Use hash key as primary key: md5(concat(code_source_system, system _id)) or md5(business_id).

If have foreign keys, need to use hashkey as a foreign key as much as possible, id can also be used if performance issues are difficult to resolve. Foreign hash keys need add default value, can use the function repeat('0'.32), but still need to pay attention to the validity of foreign keys.

Necessary technical columns: flag_deleted(case when code_change_type = 'D' then 'Y' else 'N' end), date_effective, dtime_load, id_process.

Common reference table also design in Core layer, this table need not hashkey, other almost like pct tables.eg. https://kdb.homecredit.cn/display/DEVDST/lm_ref_ivr_type

Special situation

When several tables have different sources but same structure, and always used together, it is possible to union all tables into one core table.

When two DATA_LAKE tables are always used together, and one is a code list table or a simple table. We can join them together to do one core table.

If none of column in DATA_LAKE can be determined to be unique, the composite primary key can be used. eg. md5(concat(column_1, column_2,column_3,…)). 主键选择 多个值可以识别行的唯一性

Common reference table but source table from EXT or manual import, we need confirm owner and follow the process of batch team. Ask Kunbin.Zhang or read the following wiki for details. 找到owner手动维护---wiki查维护流程

https://kdb.homecredit.cn/display/DEVDST/Data+synchronization+preparation

Name Convention: Pay attention to the Data type, especially the double transpose string: cast(cast([column_name] as bigint) as string) 字段转bigint再转

https://kdb.homecredit.cn/display/DEVDST/Naming+Conventions#NamingConventions-CoreAttribute

CORE_SEN: sensitive data. eg. Phone number, Identification Number… 敏感信息放core_sen层

Core层数据—和data lake层一致/

加主键(hkp----md5函数 只在impala下存在)系统生成id没有业务意义—系统名+id

有业务意义md5(business id)

外键:逻辑相同/连不上 32个0/ 外键不能比主键多

Flag delete 用case when生成、dtime load获取当前系统时间即可

Reference表 对照关系码

Design DIM tables:

Output: Complete Logic Mapping eg https://kdb.homecredit.cn/display/DEVDST/lm_d_contract

Main points:

Need to make sure that this is a standard dimension, otherwise you just need a core table. 需要明确为大众理解的标准维度

If table needs to save historical data, need to design a SCD2 dimension table. Notice how initial load data is obtained. If the data volume is large, it is recommended to use DATE_VALID_TO as the partitioning key. 历史数据需要缓慢变化维、怎么load数据注意变化过程

Source table can not from data_lake directly. If the logic is very complex, can create temporary tables at the stage layer. core层出数据

Dim table need add default value and default row. 增加默认值和默认行 wiki查

https://kdb.homecredit.cn/display/DEVDST/Naming+Conventions#NamingConventions-DefaultValue

The number of source tables should be kept to a minimum, but still need to pay attention to the validity of foreign keys. 外键和主键对应关系

DIM_SEN: sensitive data(Need apply), create a table on DIM save sensitive data as '******'. 加密字段 对加密数据做6个星的处理 涉及加密数据可能需要两个表(加密处理&没有加密数处理)

Design DM tables:

Output: Complete Logic Mapping eg https://kdb.homecredit.cn/display/DEVDST/lm_f_call_routepoint_pqh

逻辑复杂需要加SQL语句(关联关系 数据生成 筛选关系)

Main points:

Try not to design the detail table, but rather to base the business on the low granularity aggregation of Metris. 不放detail数据 做轻度聚合 明细在core层

If detailed data is necessary but logic is simple, can only design and use core tables, if logic is complex, according to business needs and amount of data to determine how long to retain data. If there is a large amount of data but it is necessary to occasionally access all the historical data, all the historical data should be stored in dm_arch. dm_common/dm_ops still store recent data.

数据量非常大且是明细表 data mart层不能存储全部历史数据 标注好数据存储时间范围----basic information中标注清楚。历史数据需要访问但不频繁可放arch 但不向外开放

When there is a large amount of data, the choice of partitioning key is very important, and it is necessary to use the common business date as the partitioning key (Ease of updating also n 内容过长,仅展示头部和尾部部分文字预览,全文请查看图片预览。 iewpage.action?spaceKey=IBD&title=ETL+Script+Deployment

Wiki: https://kdb.homecredit.cn/display/DEVDST/Model+specification+of+Hadoop+Platform

Data dictionary

YARN: http://prod-hadoop-master-01.homecreditcfc.cn:8088/cluster/apps/RUNNING

Can be used to check the progress of a task and report the cause of an error.

GitLab: https://git.homecreditcfc.cn/it-dev/dst-modeling

Upload the script for code review

Jupyterhub: https://jupyterhub.homecreditcfc.cn/ Can be used to debug Python/Scala code.

[文章尾部最后500字内容到此结束,中间部分内容请查看底下的图片预览]请点击下方选择您需要的文档下载。

  1. The_Necklace_Guy_de_Maupassant
  2. Flotherm瞬态分析
  3. XX中学高二年级线上教学-学生用卷三
  4. 河南专升本英语作文范文八篇
  5. 42号传祺(英文剧本)
  6. 高一英语必修一-unit1-friendship-课件
  7. F6 日期
  8. -学年**_*XX高XX区七年级第二学期期末英语试卷(Word版,无答案)
  9. 牛津译林版高二 选修6 Unit4 知识点 学案 (1)
  10. Unit 6 I'm watching TV.同课异构的教案
  11. 人教版高一英语教材Friendship warming-up课件
  12. XX省2019-2020学年牛津译林版英语七年级下册限时提高训练:7BU5 Amazing thin
  13. 外部消防栓系统
  14. G10 Pre-CAS
  15. SystemTap_Beginners_Guide
  16. Location Based TOA Algorithm for UWB Wireless Body
  17. Compression strength of stainless steel cross-sect
  18. _家庭环境、教养方式和人格对青少年违法的影响及影响路径
  19. 战略成本管理外文翻译

以上为《OPS Training 注释篇》的无排版文字预览,完整格式请下载

下载前请仔细阅读上面文字预览以及下方图片预览。图片预览是什么样的,下载的文档就是什么样的。

图片预览