Introduction

This post is a continuance to the topic presented in an earlier post of mine, SSH Bots Attack! As I took a data mining course for the spring semester, I had an opportunity to spend time furthering my research on SSH attack trends, and that’s what I indeed did.

This post is essentially a paraphrasing of the original slides as presented in the final seminar for the course; these slides and attached materials are linked at the end of this post.

Data domain, general considerations

Data

As in the previous post, I collected baseline information from the login attempts. Time, username, password, IP address. All attempts were destined to fail authentication; whileas low- and high-interaction honeypots would have certainly existed, I had decided them as too complex to maintain for a single course.

It must be noted that I will not publish any IP addresses in this post; it is possible that attackers no longer use them, and it could be possibly a GDPR issue (although that was not determined conclusively - any comments?)

Technical considerations

Unlike previously, this time I used a Raspberry Pi as the bait for the attackers. This allowed the honeypot to stay on 247, instead of being bound to my sleep rhythm.

For security purposes, the Pi was isolated to its own VLAN network, making sure that in case of a breach, it would not be able to serve as a foothold to the rest of the household gadgets. Regular auditing also took place, and nothing untoward was revealed. More importantly, no SUID-capable SSH daemons were put in harm’s way! The attacks were directed to a specialized Python script, an improved version of the one used for the previous post.

Attempts were recorded by the Python script to a text file:

H|1523742300
C|**.***.***.**|VN|YWRtaW4=|YWRtaW4=|1523742318
H|1523742320
C|***.**.***.*|DO|YWRtaW4=|cGFzc3dvcmQ=|1523742328
C|***.***.***.**|BR|YWRtaW4=|ZGVmYXVsdA==|1523742336
H|1523742340

.. which was then (pre)processed by a Ruby script. This script calculated various statistical primitives and details, plus prepared data for further processing, e.g in Octave or GraphViz.

General observations

A total of 97746 login attempts were recorded. These login attempts were split into 2487 sequences, in which one sequence is a set of continuous (close in time) attacks from same origin; roughly 39.303 attacks per sequence were recorded by average.

There were roughly 171.648 attacks per hour during the times when the SSH honeypot was running.

Most popular username was root (80109 hits / ~81.96%), and password 123456 (1388 hits / ~1.42%). Most attacks arrived from China / CN (75774 hits / ~77.52%), with a total of 93 different countries found (including unknowns as one country). When countries are sorted by count, median is at 12 attacks.

Username-password pairs

There were a total of 27783 unique combinations recorded, of which 41.558% were seen only once, and 69.172% were seen maximum of 3 times.

Most common ones tend to be weak, default passwords; no doubt an attractive target, due to the availability of such badly configured, publicly accessible systems.

Rarer ones are either unique combinations not commonly seen, or variants of the more common combinations.

Samples of most common single username-password pairs

# Username, Password, Count
admin,default,139
ubnt,ubnt,140
pi,raspberry,145
admin,1234,149
root,password,157
admin,password,167
root,admin,175
support,support,220
admin,admin,243
root,root,245

Association analysis

I attempted to draw conclusions on wherever there are patterns between various attack sequences; are there certain combinations that tend to lead to other combinations being attempted, or perhaps combinations that tend to appear together?

The answer appears to be that there aren’t clear patterns, at least in the subset of common combinations. Strongest association only covered some 8% of the sequences; however, there were confidences excess of 50%, meaning that these rules tended to apply relatively well to the small subsets they were focused on.

In light of the nature of the following examples, perhaps it could be concluded that weak combinations attract other weak combinations as a fishing expedition of sorts; one may as well try all weak combinations, as it is both likely possible for a poorly configured system and could reveal a lucrative result if a valid pair is found.

(admin:password) -> (admin:admin), (support 94 transactions / 8.0411%, confidence 60.6452%)
(admin:1234) -> (admin:admin), (support 92 transactions / 7.87%, confidence 71.875%)
(admin:1234) -> (admin:password), (support 83 transactions / 7.1001%, confidence 64.8438%)
(admin:password) -> (admin:1234), (support 83 transactions / 7.1001%, confidence 53.5484%)
(root:admin) -> (admin:admin), (support 81 transactions / 6.929%, confidence 51.9231%)
(admin:1234) -> (root:root), (support 78 transactions / 6.6724%, confidence 60.9375%)
(admin:1234) -> (user:user), (support 72 transactions / 6.1591%, confidence 56.25%)
(user:user) -> (admin:1234), (support 72 transactions / 6.1591%, confidence 67.2897%)
(admin:1234) -> (admin:admin123), (support 69 transactions / 5.9025%, confidence 53.9062%)
(admin:admin123) -> (admin:1234), (support 69 transactions / 5.9025%, confidence 71.875%)

Clusterization

In addition to association analysis, I also did some clusterization to the username-password pairs found. Sadly, due to performance concerns, I was unable to process the entire dataset in a reasonable time, so the clusterization was limited to the most common subset of pairs.

I found that yes, a tendency to form groups does indeed exist; for instance, various weak combinations come in many (relatively slight) variations, which then group together. See bottom of the post for the more precise data found.

Conclusions

Cheers, Arttu

Links to materials

Original Beamer PDF

Hierarchy clustering

Partitioning clustering