The best way of comparing two tables to determine if they are the exact same is to calculate the hash sum of each table and then compare the sum of hash. The benefit of the technique below are that no matter how many fields there are and no matter what data types the fields may be, you can use following query to do the comparison:
SELECTSUM(HASH(*)) FROM t1;
SELECTSUM(HASH(*)) FROM t2;
Of course if the schemas of the two tables are different this will by default produce different hash values.
INSERT INTO tables with VALUES option as achieved with other SQL variants is not supported in Spark SQL as of now. For single record inserts the below example provides two options:
--CREATE test tableCREATETABLE TestSchema.InsertTest USING DELTA AS (SELECT1AS row_id, 'value1'AS field_1, 'value2'AS field_2)
--INSERT INTO test tableINSERTINTO TestSchema.InsertTest SELECT t.*FROM (SELECT2, 'value3', 'value4') t;
--INSERT INTO test table while aliasing field namesINSERTINTO TestSchema.InsertTest SELECT t.*FROM (SELECT3AS row_id, 'value5'AS field_1, 'value6'AS field_2) t;
--Confirm insertSELECT*FROM TestSchema.InsertTest
Use case: sometimes files are processed were nulls are represented with text like “NULL” meaning the field is not actually empty or null.
Below are some python functions and a test demonstrating functionality.
defgetListOfMissingValues():
""" desc: List of common words used to represent null that are often found in files as text """
lst = ['NaN', 'NAN', 'nan', 'null', 'NULL', 'nul', 'NUL', 'none', 'NONE', '', ' ', ' ']
return lst
defadvanceMissingValues(df):
""" desc: Count nulls and hardcoded text that represents nulls param p1: DataFrame name return: DataFrame of field names and count values """
lstMissingVals = getListOfMissingValues()
col_list = getListOfFieldNames(df)
output = pd.DataFrame(col_list)
output.rename(columns = {0:'FieldName'}, inplace =True)
output['Count'] =''#For each field name count nulls and other null type valuesfor col in col_list:
nullCnt = df[col].isnull().sum(axis=0)
#For each missing value perform count on column
missValCnt =0for missVal in lstMissingVals:
missValCnt = missValCnt +len(df[(df[col]==missVal)])
cntTotal = nullCnt + missValCnt
output.loc[output['FieldName'] == col, 'Count'] = cntTotal
return output
#Test Setup
lst = ['NaN', 'NAN', 'nan', 'null', 'NULL', 'nul', 'NUL', 'none', 'NONE', '', ' ', ' ' ,None]
mdf = pd.DataFrame(lst)
mdf.rename(columns = {0:'NullTypes'}, inplace =True)
print(mdf)
#Run Test
chk = advanceMissingValues(mdf)
chk
# Python code demonstrate # Make headers snake caseimportpandasaspd# initialise data of lists.
data = {'First Name':['Tom', 'nick', 'krish', 'jack'], 'Age of Person':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.print(df)
# Make headers snake case
df.columns = [x.lower() for x in df.columns]
df.columns = df.columns.str.replace("[ ]", "_", regex=True)
# Print the output.print(df)
#Step 1#Find andreplace schemaName
#Find andreplace tableName
#Step 2#Find the table#Via Databricks run the Spark SQL query below
#defaultisschema, change as needed
DESC FORMATTED schemaName.tableName;
#Step 3#From the table returned scroll down to"location"andcopy the field value
#Find andreplace locationFieldValue
#Step 5#Via Databricks using Spark SQLdrop the tableDROPTABLE tableName
#Step 6#Find andreplace locationFieldValue
#By the means you use to interact with Databricks File System (dbfs), e.g. cmd python virtual environment
#Run command below
dbfs rm -r "locationFieldValue"