Comparing two tables for equality with Spark SQL

The best way of comparing two tables to determine if they are the exact same is to calculate the hash sum of each table and then compare the sum of hash. The benefit of the technique below are that no matter how many fields there are and no matter what data types the fields may be, you can use following query to do the comparison:

SELECT SUM(HASH(*)) FROM t1;
SELECT SUM(HASH(*)) FROM t2;

Of course if the schemas of the two tables are different this will by default produce different hash values.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s