SparkSQL DataFrames操作

news/2024/8/26 17:40:10

Hive中已经存在emp和dept表:

select * from emp;
+--------+---------+------------+-------+-------------+---------+---------+---------+
| empno  |  ename  |    job     |  mgr  |  hiredate   |   sal   |  comm   | deptno  |
+--------+---------+------------+-------+-------------+---------+---------+---------+
| 7369   | SMITH   | CLERK      | 7902  | 1980-12-17  | 800.0   | NULL    | 20      |
| 7499   | ALLEN   | SALESMAN   | 7698  | 1981-2-20   | 1600.0  | 300.0   | 30      |
| 7521   | WARD    | SALESMAN   | 7698  | 1981-2-22   | 1250.0  | 500.0   | 30      |
| 7566   | JONES   | MANAGER    | 7839  | 1981-4-2    | 2975.0  | NULL    | 20      |
| 7654   | MARTIN  | SALESMAN   | 7698  | 1981-9-28   | 1250.0  | 1400.0  | 30      |
| 7698   | BLAKE   | MANAGER    | 7839  | 1981-5-1    | 2850.0  | NULL    | 30      |
| 7782   | CLARK   | MANAGER    | 7839  | 1981-6-9    | 2450.0  | NULL    | 10      |
| 7788   | SCOTT   | ANALYST    | 7566  | 1987-4-19   | 3000.0  | NULL    | 20      |
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17  | 5000.0  | NULL    | 10      |
| 7844   | TURNER  | SALESMAN   | 7698  | 1981-9-8    | 1500.0  | 0.0     | 30      |
| 7876   | ADAMS   | CLERK      | 7788  | 1987-5-23   | 1100.0  | NULL    | 20      |
| 7900   | JAMES   | CLERK      | 7698  | 1981-12-3   | 950.0   | NULL    | 30      |
| 7902   | FORD    | ANALYST    | 7566  | 1981-12-3   | 3000.0  | NULL    | 20      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-1-23   | 1300.0  | NULL    | 10      |
+--------+---------+------------+-------+-------------+---------+---------+---------+

select * from dept;
+---------+-------------+-----------+
| deptno  |    dname    |    loc    |
+---------+-------------+-----------+
| 10      | ACCOUNTING  | NEW YORK  |
| 20      | RESEARCH    | DALLAS    |
| 30      | SALES       | CHICAGO   |
| 40      | OPERATIONS  | BOSTON    |
+---------+-------------+-----------+

 

DataFrame常用功能测试

val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val emp = hc.table("emp")    //根据hive表创建DataFrame

emp.dtypes.foreach(println)   //查看所有字段名称和类型
    (empno,IntegerType)
    (ename,StringType)
    (job,StringType)
    (mgr,IntegerType)
    (hiredate,StringType)
    (sal,DoubleType)
    (comm,DoubleType)
    (deptno,IntegerType)
    
emp.columns.foreach(println)  //查看所有字段名称
    empno
    ename
    job
    mgr
    hiredate
    sal
    comm
    deptno

emp.printSchema    //打印schema信息
    root
        |-- empno: integer (nullable = true)
        |-- ename: string (nullable = true)
        |-- job: string (nullable = true)
        |-- mgr: integer (nullable = true)
        |-- hiredate: string (nullable = true)
        |-- sal: double (nullable = true)
        |-- comm: double (nullable = true)
        |-- deptno: integer (nullable = true)

emp.explain  //查看物理执行计划
== Physical Plan ==
HiveTableScan [empno#0,ename#1,job#2,mgr#3,hiredate#4,sal#5,comm#6,deptno#7], (MetastoreRelation default, emp, None), None

emp.show  #默认显示20行
    empno ename  job       mgr  hiredate   sal    comm   deptno
    7369  SMITH  CLERK     7902 1980-12-17 800.0  null   20    
    7499  ALLEN  SALESMAN  7698 1981-2-20  1600.0 300.0  30    
    7521  WARD   SALESMAN  7698 1981-2-22  1250.0 500.0  30    
    7566  JONES  MANAGER   7839 1981-4-2   2975.0 null   20    
    7654  MARTIN SALESMAN  7698 1981-9-28  1250.0 1400.0 30    
    7698  BLAKE  MANAGER   7839 1981-5-1   2850.0 null   30    
    7782  CLARK  MANAGER   7839 1981-6-9   2450.0 null   10    
    7788  SCOTT  ANALYST   7566 1987-4-19  3000.0 null   20    
    7839  KING   PRESIDENT null 1981-11-17 5000.0 null   10    
    7844  TURNER SALESMAN  7698 1981-9-8   1500.0 0.0    30    
    7876  ADAMS  CLERK     7788 1987-5-23  1100.0 null   20    
    7900  JAMES  CLERK     7698 1981-12-3  950.0  null   30    
    7902  FORD   ANALYST   7566 1981-12-3  3000.0 null   20    
    7934  MILLER CLERK     7782 1982-1-23  1300.0 null   10 

emp.show(10) #显示指定行数

emp.limit(5).show
emp.head(3)
emp.head   #等价于head(1)
emp.first  #等价于head(1)
val emp_as = emp.as("emp_as")   #别名
emp_as.select("empno","ename","deptno").show

#查看指定列:
emp.select("empno","ename","deptno").show
emp.select($"empno",$"ename",$"deptno").show
emp.selectExpr("empno", "ename as name", "substr(ename,0,4)").show     #配合udf使用
emp.select($"empno",$"sal"+100).show  #给sal加100


#条件过滤:
emp.filter("empno>7698").show
emp.filter($"empno" > 7698).show
emp.where($"empno" > 7698).show

#排序:
emp.sort("empno").show  #默认升序
emp.sort($"empno").show
emp.sort("empno").show
emp.sort($"empno".desc).show
emp.sort($"deptno", $"empno".desc).show #多字段排序

emp.orderBy($"empno").show 
emp.orderBy($"empno".desc).show 
emp.orderBy($"deptno", $"empno".desc).show
    
#分组:    
emp.groupBy("deptno").count.show
emp.groupBy($"deptno").avg().show   #所有的列求平均值
emp.groupBy($"deptno").avg("sal").show   #sal列求平均值
emp.groupBy($"deptno").agg("sal"->"max").show   #sal取最大
emp.groupBy($"deptno").agg("sal"->"min").show   #sal取最小
emp.groupBy($"deptno").agg("sal"->"sum").show   #sal求和
emp.groupBy($"deptno").agg("sal"->"avg").show   #sal求平均值
#agg中能有的方法有: avg/max/min/sum/count


#join:    
val dept = hc.table("dept")
dept.show
emp.join(dept,emp.col("deptno") === dept.col("deptno"),"left_outer").show
emp.join(dept,emp.col("deptno") === dept.col("deptno"),"right_outer").show
emp.join(dept,emp.col("deptno") === dept.col("deptno"),"inner").show
emp.join(dept,$"emp.deptno"===$"dept.deptno" ,"inner").select("empno","ename","dname").show

 

DataFrames结合SQL使用测试

val emp_dept = emp.join(dept,emp.col("deptno") === dept.col("deptno"),"left_outer")
emp_dept.registerTempTable("emp_dept_temp")
hc.sql("select count(*) from emp_dept_temp").collect

 

DataFrames结合hive和mysql jdbc external datasource使用测试:

mysql中准备数据:

DROP TABLE IF EXISTS `dept`;
CREATE TABLE `dept` (
  `deptno` int(11) NOT NULL DEFAULT '0',
  `dname` varchar(30) DEFAULT NULL,
  `loc` varchar(30) DEFAULT NULL,
  PRIMARY KEY (`deptno`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `dept` VALUES ('10', 'ACCOUNTING', 'NEW YORK');
INSERT INTO `dept` VALUES ('20', 'RESEARCH', 'DALLAS');
INSERT INTO `dept` VALUES ('30', 'SALES', 'CHICAGO');
INSERT INTO `dept` VALUES ('40', 'OPERATIONS', 'BOSTON');

 

val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val emp = hc.table("emp")    
val dept_jdbc = hc.jdbc("jdbc:mysql://hadoop000:3306/hive?user=root&password=root", "dept")
emp.join(dept_jdbc, emp.col("deptno") === dept_jdbc.col("deptno"), "left_outer").show    
    
    empno ename  job       mgr  hiredate   sal    comm   deptno deptno dname      loc     
    7782  CLARK  MANAGER   7839 1981-6-9   2450.0 null   10     10     ACCOUNTING NEW YORK
    7839  KING   PRESIDENT null 1981-11-17 5000.0 null   10     10     ACCOUNTING NEW YORK
    7934  MILLER CLERK     7782 1982-1-23  1300.0 null   10     10     ACCOUNTING NEW YORK
    7369  SMITH  CLERK     7902 1980-12-17 800.0  null   20     20     RESEARCH   DALLAS  
    7566  JONES  MANAGER   7839 1981-4-2   2975.0 null   20     20     RESEARCH   DALLAS  
    7788  SCOTT  ANALYST   7566 1987-4-19  3000.0 null   20     20     RESEARCH   DALLAS  
    7876  ADAMS  CLERK     7788 1987-5-23  1100.0 null   20     20     RESEARCH   DALLAS  
    7902  FORD   ANALYST   7566 1981-12-3  3000.0 null   20     20     RESEARCH   DALLAS  
    7499  ALLEN  SALESMAN  7698 1981-2-20  1600.0 300.0  30     30     SALES      CHICAGO 
    7521  WARD   SALESMAN  7698 1981-2-22  1250.0 500.0  30     30     SALES      CHICAGO 
    7654  MARTIN SALESMAN  7698 1981-9-28  1250.0 1400.0 30     30     SALES      CHICAGO 
    7698  BLAKE  MANAGER   7839 1981-5-1   2850.0 null   30     30     SALES      CHICAGO 
    7844  TURNER SALESMAN  7698 1981-9-8   1500.0 0.0    30     30     SALES      CHICAGO 
    7900  JAMES  CLERK     7698 1981-12-3  950.0  null   30     30     SALES      CHICAGO 

 

DataFrames结合parquet和mysql jdbc external datasource使用测试:

 


http://www.niftyadmin.cn/n/2828920.html

相关文章

Brackets 小技巧 | IDE

在HTML里编辑CSS 在HTML文件里,鼠标放到一个要编辑样式的标签上,快捷键CTL E,就可以编辑样式。同样,也可以通过双击HTML元素编辑。 开启DevTool cmd opt I 即可打开自带的开发者工具 实时预览 Brackets支持两种预览方式&#x…

EBS FORM开发步骤

EBS FORM开发步骤(step by step)转载于:https://blog.51cto.com/9966064/1616954

瓦瑟纳尔新规将冲击漏洞市场

本文讲的是瓦瑟纳尔新规将冲击漏洞市场,漏洞奖励已不算是非常新鲜的事物,并渐渐发展成为平常之需。不仅公司企业希望能够从奉公守法的漏洞挖掘人员那里获取信息,找到自身的弱点,另一方面也是大量独立研究人员从厂商和第三方漏洞奖…

GitHub 开官微了,怎么看

北京时间 2015 年 2 月 27 日世界上最大的同性交友网站 GitHub 正式登陆新浪微博,短短几日的功夫,粉丝数量也上万了。作为仅仅针对程序员这么“小众”的群体的一个网站,这个成绩足以说明 GitHub 在中国程序员心中的地位。 GitHub 早在 2012 年…

Horizon View 6.0 基于RDS的应用发布

最近测试VMware Horizon View6.0 新产品,相比较之前版本还是有许多改善,在应用发布这块通过RDS主机实现发布,通过View Client 链接即可显示所发布的应用。本着学习的心态将自己所了解的写出来,如有不足还请各位赐教。Horizon view…

18.1 集群介绍;18.2 keepalived介绍;18.3,18.4,18.5 用keepalived配置高可用集群(上,中,下);...

18.1 集群介绍1. 根据功能划分为两大类:高可用和负载均衡2. 高可用集群通常为两台服务器,一台工作,另外一台作为冗余,当提供服务的机器宕机,冗余将接替继续提供服务3. 实现高可用的开源软件有:heartbeat、k…

Android底层开发(2)

Android环境搭建与编译

坑,坑,

[str appendString:[NSString stringWithFormat:"/Commodity/GoodsDescription?sid%&width640",self.info.Sid]]; 后面带个尺寸,转载于:https://www.cnblogs.com/guligei/p/4028807.html