2-Minute Read

365 words

Join is very common in SQL, but usually, the types used are inner join, left join, right join, and outer join. This article introduces what semi join and anti join are, using PySpark as an example.

Scenario

You work at an e-commerce company. Typically, customer data and transaction data are not stored in the same table. To associate customer names with transaction records, you need to use join. There is a customers table storing customer names and cities, and a transactions table storing customer IDs, cities, and transaction amounts as follows:

from pyspark.sql import SparkSession
from pyspark.sql import Row
spark = SparkSession.builder.getOrCreate()
df_customers = spark.createDataFrame([
    Row(id=1, name='John', city='CityA'),
    Row(id=2, name='Irene', city='CityB'),
    Row(id=3, name='Allen', city='CityA'),
])
df_customers.createOrReplaceTempView("customers")
df_transections = spark.createDataFrame([
    Row(id=1, transection_value=100, city='CityA'),
    Row(id=3, transection_value=20, city='CityA'),
])
df_transections.createOrReplaceTempView("transections")

Today, you want to find the names of customers living in cities with purchase records. You can do this:

spark.sql("""
SELECT
  name
FROM customers
WHERE city IN (
  SELECT DISTINCT(city)
  FROM transections
)
""").show()

Ooutput:

+-----+
| name|
+-----+
| John|
|Allen|
+-----+

Conversely, if you want to find the names of customers living in cities without purchase records, you can do this:

spark.sql("""
SELECT
  name
FROM customers
WHERE city NOT IN (
  SELECT DISTINCT(city)
  FROM transections
)
""").show()

Output:

+-----+
| name|
+-----+
|Irene|
+-----+

Semi Join

You only need the data from table A, but it must also appear in table B. This is where you can use semi join, which matches our requirement of needing information from customers but requiring city to appear in both customers and transactions:

spark.sql("""
SELECT
  c.name
FROM customers c
LEFT SEMI JOIN transections t ON c.city = t.city
""").show()

Ooutput:

+-----+
| name|
+-----+
| John|
|Allen|
+-----+

Anti Join

anti join is the opposite of semi join. You only need data from table A, but it should not appear in table B. To find the names of customers living in cities without purchase records, you can do this:

spark.sql("""
SELECT
  c.name
FROM customers c
LEFT ANTI JOIN transections t ON c.city = t.city
""").show()

Output:

+-----+
| name|
+-----+
|Irene|
+-----+

Note: Not every SQL variant supports semi join and anti join syntax.

References

  1. PySpark Join Types
  2. Stackoverflow Discussion
comments powered by Disqus

Recent Posts

Categories

Tags