Join
is very common in SQL, but usually, the types used are inner join
, left join
, right join
, and outer join
. This article introduces what semi join
and anti join
are, using PySpark
as an example.
Scenario
You work at an e-commerce company. Typically, customer data and transaction data are not stored in the same table. To associate customer names with transaction records, you need to use join
. There is a customers
table storing customer names and cities, and a transactions
table storing customer IDs, cities, and transaction amounts as follows:
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession.builder.getOrCreate()
df_customers = spark.createDataFrame([
Row(id=1, name='John', city='CityA'),
Row(id=2, name='Irene', city='CityB'),
Row(id=3, name='Allen', city='CityA'),
])
df_customers.createOrReplaceTempView("customers")
df_transections = spark.createDataFrame([
Row(id=1, transection_value=100, city='CityA'),
Row(id=3, transection_value=20, city='CityA'),
])
df_transections.createOrReplaceTempView("transections")
Today, you want to find the names of customers living in cities with purchase records. You can do this:
spark.sql("""
SELECT
name
FROM customers
WHERE city IN (
SELECT DISTINCT(city)
FROM transections
)
""").show()
Ooutput:
+-----+
| name|
+-----+
| John|
|Allen|
+-----+
Conversely, if you want to find the names of customers living in cities without purchase records, you can do this:
spark.sql("""
SELECT
name
FROM customers
WHERE city NOT IN (
SELECT DISTINCT(city)
FROM transections
)
""").show()
Output:
+-----+
| name|
+-----+
|Irene|
+-----+
Semi Join
You only need the data from table A, but it must also appear in table B. This is where you can use semi join
, which matches our requirement of needing information from customers
but requiring city to appear in both customers
and transactions
:
spark.sql("""
SELECT
c.name
FROM customers c
LEFT SEMI JOIN transections t ON c.city = t.city
""").show()
Ooutput:
+-----+
| name|
+-----+
| John|
|Allen|
+-----+
Anti Join
anti join
is the opposite of semi join
. You only need data from table A, but it should not appear in table B. To find the names of customers living in cities without purchase records, you can do this:
spark.sql("""
SELECT
c.name
FROM customers c
LEFT ANTI JOIN transections t ON c.city = t.city
""").show()
Output:
+-----+
| name|
+-----+
|Irene|
+-----+
Note: Not every SQL variant supports semi join
and anti join
syntax.