[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
DES, MMX, and FPGAs
In DES key search, the speed of a given machine is controlled by three
main factors: clock speed, the algorithm used, and the availability of
multiple issue (the last refers to perfoming 2 or more instructions in
parallel).
Clock speed can be easily factored out for comparison purposes.
Algorithm is harder; a bitslice algorithm is a clear win on 64 bit
machines with lots of registers, but is roughly equivalent to
non-bitslice algorithms on 32 bit machines with a low register
count (eg, Intel processors, even using MMX).
Multiple issue can greatly speed up a specificly tuned assembly
language version. I can't speak for the non-intel processors below,
but the Pentium code below is all tuned for the vanilla Pentium, and
does not get a boost from the PPro or P2.
Let's do the numbers:
d.net peak keys per second: 34,430,460,000
cpk (clocks per key) = (equiv * clock speed)/34B
Processor clock (MHz) equiv cpk
DEC Alpha 321064 142 (4)
DEC Alpha 21066 220 (4)
DEC Alpha 21064 533 11,264 176 (1)
DEC Alpha 21164 (Deschall) 94 (3)
Sun Ultra I 167 15,316 75 (1)
SGI r10k 69 (4)
Macintosh PowerPC 604e 200 68,859 403 (1)(2)
Macintosh PowerPC 604e 200 137 (4)
Intel Pentium II 333 22,393 219 (1)(6)
Intel Pentium Pro 232 (4)(6)
Intel Pentium 166 41,712 203 (1)
Intal Pentium (Bryddes) 195 (5)
Intel 486DX2 66 399,374 775 (1)
Intel 386SX 20 7,446,033 4380 (1)
(1) D.net press release.
(2) This figure comes from the press release. I don't believe it
for a moment. The next line gives a speed determined from a
posting to the d.net mailing list, and is much more believable -
if anything, it appears a little slow.
(3) Deschall client
(4) based on d.net mailing list posting.
(5) Sven Mikkelsen's page.
(6) I'm not aware of any fielded bitslice software for Intel MMX
platforms. I suspect that it would not speed things up too much -
there are simply too few registers, and you'd spend a lot of time
moving stuff in and out of cache.
I wouldn't be suprised if some of the figures above are incorrect,
but it's clear that the best processors are getting down to 70-100
clock cycles per key.
Wiener's keysearch engine unrolls the 16 rounds of DES, and is thus
able to run in a similar way to an assembly line; once it's 'filled
up' it tests 1 key per clock cycle. Any other version would run a
lot slower.
Some highly uninformed blather about FPGAs follows:
Wiener's paper uses custom VLSI chips, and claims a need for
26k "gates" or 78257 "sites". I'm not sure how this translates to
silicon requirements in FPGAs - it seems that "gate" and "site" are
nominal terms which are not directly equivalent to actual, physical
gates. I've found one reference to a general DES FPGA implementation
which requires 25k gates, though it's not clear if this reuses the
round circuitry, or 'unrolls' it (I suspect the latter, which would
be good).
I'm also not certain if a FPGA running at, for example, 50 MHz, can
actually run logic operations at that speed. Some things I've read
suggest that FPGAs run ops at several clocks per operation - which
is still a lot better than a general purpose microprocessor.
The fastest and largest FPGAs I'm aware of are the Xilinx 4000
series: The xc40125 has about 80,000 gates, and runs at around
100 MHz (and costs $1500!!).
Suppose we use this chip, and everything works out optimally. We
can fit 3 des engines per chip, and get 300 Mkeys/sec checked. To
duplicate d.net's speed, we would need about 115 chips, or about
$170k for the chips alone.
A better choice would be the Xilinx Spartan XL series. This might be
in the $20 - $40 a chip, and would run at 80MHz, and would fit only
one des engine. This would be around $9k - $17k, which is still
pretty high for an individual.
At this speed (34Gkey/sec), it would still take 12 days to search
half the keyspace, and the prices quoted are for the chip alone, no
boards, chassies, or SW, and the speeds are assumed to be
1 key/engine/clock, which is optomistic.
On the plus side, implementing a DES cracker in an FPGA actually
affords us some efficiency savings over a non-reconfigurable chip. The
cipher and plaintexts can be compiled into the design, and if you're
willing to reload the chip frequently, the slower-changing portions of
the key can be built in as well. This cuts the gate count
considerably.
I'd really like to see the next DES challenge taken on by some
organization with a few hundred under utilized FPGA boards.
Peter Trei
[email protected]