-
-
Notifications
You must be signed in to change notification settings - Fork 968
Update GemTypo to use the -/_ variation detection #2341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GemTypo to use the -/_ variation detection #2341
Conversation
This update changes the method used by GemTypo from the Levenshtein distance comparison with popular Gems with 10 million or more downloads to a simpler check that compares the proposed Gem with all of its - and _ characters removed to all existing Gems with their - and _ characters removed. This comparison is performed completely within PostgreSQL and an index is added to ensure that a table scan is not neccesary. This means that this typo protection can be enforced againstn any other already published Gem rather than just popular ones with 10 million or more downloads. My April 2020 analysis of the typo squatting attack in February of 2020 found that none of the target Gems were protected at the time and the vast majority of the malicious Gem names would have been prevented by this check if it had been in place at the time. Please see https://speakerdeck.com/rietta/securing-the-open-source-software-supply-chain slides 22 and 23 for the publication of my findings at this time. Fix same name error This update fixes the error where a gem that is being upgraded was considered a typo squat. Additionally, I simplified the protected_typo? method a bit and ensured that only one query was neccesary rather than two without overfetching data. Refactor GemTypo tests Updates to index based on internal code review and stats gathering Fix broken test Allow update gem pre-existing Gem that is a typo variant of another For existing data, where there may be pre-existing typo variants such that awesome_gem and awesome-gem both already exist. Then skip typo checking if the current gem name matches either. This does not reduce security in that other validations already prevent a user from publishing a Gem that has the exact same name as a Gem already published by another user. Consider yanked Gems when determining This update considers the yanked_at status for potential exact name matches such that a Gem that has at least one not-yanked version would be considered an exact match. However, a Gem that has all versions yanked and is also a typo squat of another Gem would be determined to be a typo squat. Update GemTypo to use existing case insensitive name index This update removes the unnecsary downcasing of the target Gem name and uses the existing UPPER(name) index on the rubygems table to check for exact name matches.
Any feedback? Concerns? |
Is there anything I can do to help address silent concerns about this approach? While I am not a committer, I am willing to help code review some of the other pull requests as well if that would be helpful. |
We don't have any personal vendetta against you or this PR 😅 Reviews are hard to come by on rubygems.org in general. |
Running rake db:migrate had added these limits to the schema automatically. This reverts them.
Thanks @sonalkr132! For your comparison purposes, I just tested the creation of the index on a fresh copy of the database dump from 2020-05-15: SQL
Results
|
ℹ️ Adding index should not be problem with concurrently strategy since it doesn't lock the table and is safe to do in production without any downtime. Should be enough to change index definition to |
One open question is is it desirable to consider the user ID such that a user may typo squat on his or own own gem. So if there was an awesome_gem then the same author could publish awesome-gem. This has happened before. I do not believe that it is necessarily desirable and leads to confusion. Of course existing gems that are not malicious should be allowed to continue to be updated. |
To get a sense of some of the Gems names that would be blocked, in the PostgreSQL database dump, you can run: SELECT
A.id,
A.name,
A.created_at,
A.updated_at,
B.id AS match_rubygem_id,
B.name AS match_name,
B.created_at AS match_created_at,
B.updated_at AS match_updated_at
FROM rubygems A INNER JOIN rubygems B ON
A.name <> B.name
AND regexp_replace(upper(A.name), '[_-]', '', 'g') = regexp_replace(upper(B.name), '[_-]', '', 'g')
AND A.created_at > B.created_at
ORDER BY updated_at DESC |
Thanks @simi. I just committed this suggested change to the database migration. |
Actually @simi, CREATE INDEX CONCURRENTLY seems to have broken the CI build.
|
87c6454
to
e45e77d
Compare
db/migrate/20200429005140_add_dash_underscore_typo_detection_index_to_rubygems.rb
Show resolved
Hide resolved
8000
This update at the suggestion of @simi, asks PostgreSQL to create the new dashunderscore_typos_idx index in the background without blocking. This will allow the migration to run quickly and cause no downtime.
We use https://github.com/adomokos/light-service pretty heavily in production and I know there has been a PR from @adomokos (adomokos/light-service#62) to change it from |
@ni3t, I have not done the rename itself, it's still named |
@simi What are your thoughts on the gems by the same user aspect? Worth a tweak or should be left out? |
I wouldn't recommend it. Following was pointed to us, I hope we agree that allowing this is just confusing: Perhaps we can update this to match against indexed rubgyems only. |
If there is a legitimate need to rename Gems, I would prefer to see an official method that preserves the old name as a symlink/slug rather than adhoc duplication. |
test/unit/gem_typo_test.rb
Outdated
assert_equal true, gem_typo.protected_typo? | ||
end | ||
gem_typo = GemTypo.new("delayed_job_active_record") | ||
assert_equal false, gem_typo.protected_typo? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain the difference between return false for exact match
and return false for any exact match
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test asserts that the owner of the existing delayed_job_active_record Gem can push an update even though there is an existing typo squat delayed-job-active-record that would otherwise block the update.
test/unit/gem_typo_test.rb
Outdated
end | ||
should "return true for three -/_ character change" do | ||
gem_typo = GemTypo.new("delayed-job-active-record") | ||
assert_equal true, gem_typo.protected_typo? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please justify your reasoning of adding two -/_ change
and three -/_ character change
? I am not sure I understand what additional value are these two adding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons. 1) The regex has a 'g' attribute for replace all, but if it didn't it would only replace the first one. Without at least one of these tests that would not fail if the regex was wrong and only replaced the first match. 2) It keeps these tests parallel with the previous tests.
test/unit/gem_typo_test.rb
Outdated
end | ||
should "return true for three -/_ character change" do | ||
gem_typo = GemTypo.new("delayed-job-active-record") | ||
assert_equal true, gem_typo.protected_typo? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add tests for logic in published_exact_name_matches
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this also means changing published_exact_name_matches
from being a private method to a public one. That's why I had not written explicit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this doesn't have to mean published_exact_name_matches
to be public. Testing protected_typo?
for behavior when yanked gem exists would work as well. Does such an implicit test already exist and I am just not noticing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is the equivalent to
RubyGem.with_versions.where('upper(rubygems.name) = upper(?)', @rubygem_name)
It is looking for published Gems with the name of our Gem. This is the method that allows grandfathered Gem to actually be published. Otherwise, if it is not indexed and typo squats on another variant it will be blocked.
return false if @rubygem_name.size < GemTypo::SIZE_THRESHOLD | ||
|
||
gem_typo_exceptions = GemTypoException.all.pluck(:name) | ||
return false if gem_typo_exceptions.include?(@rubygem_name) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
app/models/gem_typo.rb
Outdated
class GemTypo | ||
attr_reader :protected_gem | ||
|
||
include Gem::Text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and require on line 1 can be removed.
if: :needs_name_validation? | ||
validate :blacklist_names_exclusion | ||
# validate :protected_gem_typo, on: :create, unless: -> { Array(validation_context).include?(:typo_exception) } | ||
vali 751D date :protected_gem_typo, on: :create, unless: -> { Array(validation_context).include?(:typo_exception) } |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
app/models/gem_typo.rb
Outdated
.pluck(:name) | ||
# This SQL query uses an index and thus does not induce a full table | ||
# scan in PostgreSQL. See | ||
# 20200429005140_add_dash_underscore_typo_detection_index_to_rubygems.rb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we want to keep this comment. It doesn't seem to be explaining anything out of ordinary. May add it to commit message?
db/migrate/20200429005140_add_dash_underscore_typo_detection_index_to_rubygems.rb
Outdated
Show resolved
Hide resolved
Thanks @DavisC0801 for pushing the fix! |
@sonalkr132 Thanks. I think this is ready for another look now. |
app/models/rubygem.rb
Outdated
|
||
return unless gem_typo.protected_typo? | ||
errors.add :name, "'#{name}' is too close to typo-protected gem: #{gem_typo.protected_gem}" | ||
errors.add :name, "Your gem '#{name}' is too similar to typo-protected gem named '#{gem_typo.protected_gem}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change type-protected gem
to an existing gem
. There is no non-typo-protected gem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote error message on branch, thank you for the suggestion!
# assert_equal "There was a problem saving your gem: Name 'test' is too close to typo-protected gem: best", @response.body | ||
# end | ||
# end | ||
context "On POST to create with a protected gem name" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2B87 td> | context "On POST to create with a protected gem name" do |
context "On POST to create with an underscore or dash variant of an existing gem" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added suggestion on-branch, thank you!
Sorry, something went wrong.
refute gem_typo.protected_typo? | ||
end | ||
|
||
should "return false for an exact match of a yanked gem so a gem with an identical name can be published in the future" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as test on line 12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored tests to test deleted_active_record_gem in order to avoid confusion, test gem was previously named deleted_job_active_record.
end | ||
should "return true for three -/_ character change" do | ||
gem_typo = GemTypo.new("delayed-job-active-record") | ||
assert gem_typo.protected_typo? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test case for delayed-job-activerecord
being protected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added on branch, including additional tests for missing -/_.
app/models/gem_typo.rb
Outdated
return true | ||
end | ||
end | ||
match = matched_protected_gem_names.select(:id, :name).first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we selecting id
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we selecting
id
here?
For backwards compatibility with any existing code that was using the match to render an error message. It selected the object, so this selects the ID and the name as a minimum set needed to instantiate the model for that purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility with any existing code
I don't think there is any existing code that uses id. If there is such code can you please point it out to me? My issue with using ID here is that it will create confusion in the future about why was this added if not used anywhere.
db/schema.rb
Outdated
# It's strongly recommended that you check this file into your version control system. | ||
|
||
ActiveRecord::Schema.define(version: 2019_08_31_172741) do | ||
ActiveRecord::Schema.define(version: 2020_04_29_005140) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update this number. we deployed some migrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated on branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge conflict has been resolved.
please fix the merge conflict. |
@sonalkr132 The merge conflict appears to have been resolved. |
should respond_with :forbidden | ||
should "not register new gem" do | ||
assert_equal 1, Rubygem.count | ||
assert_equal "There was a problem saving your gem: Name Your gem 'test' is too similar to an existing gem named 't_es-t'", @response.body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please drop Your gem
. Following reads better:
Name 'test' is too similar to an existing gem named 't_es-t'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed on branch, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one minor comment. Otherwise LGTM 👍 We can merge this in a few days if no one else has any further comments.
Anything left to do? |
Not really. I wanted to test this locally before merging. Probably tomorrow or the day after it. |
Thank you for your diligence. It speaks well to your leadership of this project :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this is not hitting the index added here. please fix that. Thank you.
return true | ||
end | ||
end | ||
match = matched_protected_gem_names.select(:name).first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of using first
here, rails is adding order by
and it is not using the intended index:
rubygems_development=# explain analyze SELECT "rubygems"."name" FROM "rubygems" WHERE "rubygems"."indexed" = true AND (regexp_replace(upper(name), '[_-]', '', 'g') = r
egexp_replace(upper('r__d-oc-js--on---'), '[_-]', '', 'g')) ORDER BY "rubygems"."id" ASC limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..15.43 rows=1 width=18) (actual time=310.622..310.623 rows=1 loops=1)
-> Index Scan using rubygems_pkey on rubygems (cost=0.42..12038.09 rows=802 width=18) (actual time=310.620..310.620 rows=1 loops=1)
Filter: (indexed AND (regexp_replace(upper((name)::text), '[_-]'::text, ''::text, 'g'::text) = 'RDOCJSON'::text))
Rows Removed by Filter: 172345
Planning time: 0.340 ms
Execution time: 310.659 ms
(6 rows)
rubygems_development=# explain analyze SELECT "rubygems"."name" FROM "rubygems" WHERE "rubygems"."indexed" = true AND (upper(name) != upper('r__d-oc-json---') and
regexp_replace(upper(name), '[_-]', '', 'g') = regexp_replace(upper('r__d-oc-js--on---'), '[_-]', '', 'g'));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on rubygems (cost=23.09..1414.05 rows=802 width=14) (actual time=0.048..0.050 rows=1 loops=1)
Recheck Cond: (regexp_replace(upper((name)::text), '[_-]'::text, ''::text, 'g'::text) = 'RDOCJSON'::text)
Filter: (indexed AND (upper((name)::text) <> 'R__D-OC-JSON---'::text))
Heap Blocks: exact=1
-> Bitmap Index Scan on dashunderscore_typos_idx (cost=0.00..22.89 rows=862 width=0) (actual time=0.033..0.033 rows=1 loops=1)
Index Cond: (regexp_replace(upper((name)::text), '[_-]'::text, ''::text, 'g'::text) = 'RDOCJSON'::text)
Planning time: 0.299 ms
Execution time: 0.096 ms
(8 rows)
.pluck(:name) | ||
def matched_protected_gem_names | ||
Rubygem.with_versions.where( | ||
"upper(name) != upper(?) AND regexp_replace(upper(name), '[_-]', '', 'g') = regexp_replace(upper(?), '[_-]', '', 'g')", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upper(name) != upper(?)
seems unnecessary. It is needed please add a test case which would fail if I remove this.
Using `first` was adding order by and it was not hitting index added in #2341. Removed upper(name) != upper(@rubygem_name) as it was redundant when using `published_exact_name_matches`. `rubygems/text` was not used since we removed `levenshtein distance` func.
merged in 86ab6ab Thank you for working on this. |
Thank you. It was a pleasure to contribute. |
This update changes the method used by GemTypo from the Levenshtein distance
comparison with popular Gems with 10 million or more downloads to a simpler
check that compares the proposed Gem with all of its - and _ characters removed
to all existing Gems with their - and _ characters removed.
This comparison is performed completely within PostgreSQL and an index is added
to ensure that a table scan is not necessary. This means that this typo protection
can be enforced against any other already published Gem rather than just popular
ones with 10 million or more downloads.
To support the suitability of this new approach, we conducted an analysis of the malicious gems published in the Reversing Labs data set from February 2020. A rake task was used to use the GemTypo class to check each known malicious gem to for matches. Timing data was collected and written out to a CSV file, which was then loaded into spreadsheet to compute the relevant statistics.
The code for collecting the data was:
With a sample size of 725 database checks, the timing in seconds for checks resulting in:
Please see this attached XLSX for details.
Malicious Gem Detection Timing Statistics Based on New Method and Reversing Labs List of Baddies.xlsx
My April 2020 analysis of the typo squatting attack in February of 2020 found that
none of the target Gems were protected at the time and the vast majority of the
malicious Gem names would have been prevented by this check if it had been in place
at the time. Please see https://speakerdeck.com/rietta/securing-the-open-source-software-supply-chain
slides 22 and 23 for the publication of my findings at this time.