The issue at hand is: I have
a table codeName,
where code is constant, but name could be changed. So there are occasions where
a code can have multiple names in this table, thus the need to delete duplicated
names. It is easy to get the code having multiple names:
select code, count(*)
from codeName
group by code
having count(*)>1
The difficult part is to leave the newest name in the table
while removing all other names. There is a column “date” which
specifies when a name is entered into the table, so the requirement is to leave
the name with the newest “date” in the table. To find out the
code with the latest naming date:
select code, date, count(*)
from(
select * from codeName
order by code, date desc
)a
group by code
having count(*)>1
(It is interesting that MYSQL
allows to have fields in the select clause that are not in the grouping functions, in
the above sql, date is not in the group by clause, and yet it is allowed. I recall other
relational database doesn’t support this?)
The nature of the
problem distil to: how to number the records in each groups. If this can be
done, than the solution is as easy as to delete all the records except the first
one.
Other relational
databases (Oracle e.g.) have analytical functions that can do this, MYSQL doesn’t
have it. The trick is to use session variables:
delete b.* from codeName b
inner join (
select * from (
SELECT code, name, date,
@rank := IF(@code = code, @rank + 1, 1) AS rank,
@code := code
FROM codeName
ORDER BY code, date DESC
)a
where a.rank!=1
)c
where b.code=c.code and b.date=c.date
@rank, @code are session
variables, you do not need to declare them beforehand. @rank := IF(@code = code, @rank + 1, 1) AS rank functions
just as a simple if statement, the
first time it was invoked, @code is NULL, so @rank is assigned 1; Since the query is order by code,
date desc, if there are multiple names for the same code, @rank
will be incremented.
Another interesting
MYSQL syntax is on delete b.* from, it is sytax-wrong to say delete from.
In the above example,
session variables function just as a simple if procedure, but apparently, it
is more complicated than that, e.g.
select @rank:=IF(@rank is not NULL, @rank+1, 1 ) as rank, name from codeName
order by name
select @rank
This shows @rank=1, which is incorrect.
Open another session:
set @rank=null
select @rank:=IF(@rank is not NULL, @rank+1, 1 ) as rank, name from codeName
order by name
select @rank
This shows @rank=3172,which is correct.
The only difference is
that the second one declare @rank first, but it declares it to be null, so it shouldn’t
matter?
Another tricky thing is
that when using sqlalchemy, apparently, you have to declare session
variables first, otherwise, you will get the result as in the first example (e.g.
@rank
will always be 1). So in the original problem, there are two session variables
which should be declared separately:
connection = engine.connect()
sql0="set @rank=0" sql1="set @code=''"
connection.execute(sql0)
connection.execute(sql1)
…
connection.close()
No comments:
Post a Comment