Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit.
Fixes#345
The Wayback Machine Save API only allows a limited number of requests within a timespan. This introduces several changes to avoid rate limit errors:
- There will be max. 1 attempt to create a new snapshot
- If a new snapshot could not be created, then attempt to use the latest existing snapshot
- Bulk snapshot updates (bookmark import, load missing snapshots after login) will only attempt to load the latest snapshot instead of creating new ones
* Add links to remove tags from current query
* Display selected tags in tag cloud
* Add tag cloud tests
* Fix tag cloud in archive
* Add tests for bookmark views
* Expose parse query string
* Improve tag cloud tests
* Cleanup
* Fix rebase issues
* Ignore casing when removing tags from query
Co-authored-by: Jon Hauris <jonp@hauris.org>
* Allow marking bookmarks as shared
* Add basic share view
* Ensure tag names in tag cloud are unique
* Filter shared bookmarks by user
* Add link for filtering by user
* Prevent n+1 queries when rendering bookmark list
* Prevent empty query params in return URL
* Fix user select template tag name
* Create shared bookmarks through API
* List shared bookmarks through API
* Show bookmark suggestions for shared view
* Show unique tags in search suggestions
* Sort user options
* Add bookmark sharing feature flag
* Add test for share setting default
* Simplify settings view
* Add apple-touch-icon reference in header
Recommend adding this reference to support an icon when adding a web app to an iOS homescreen.
* Add dedicated apple touch icon
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
* Add POST archived API endpoint
* Update API docs
* Expose is_archived in existing POST endpoint
* Add test to verify bookmark not archived by default
* Fix JSON payload in API docs
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>
* Avoid stall on web scraping
This patch fixes stall on web scraping.
I encountered a stall (scraping never ends) when adding
a bookmark of some site.
To avoid this case, adding a timeout parameter at requests.get()
function is a solution.
Signed-off-by: Taku Izumi <admin@orz-style.com>
* Avoid character corruption of scraping some Japanese sites
This patch fixes character corruption of scraping some Japanese
sites. To avoid character corruption, I use r.content instead
of r.text in load_page function.
The reason of character corruption is encoding problem, I think.
r.text handles data as unicode encoded text, so if scraping
web site's charset is not unicode encoded, character corruption
occurs. r.content handles data as str[], we can avoid encoding
problem.
Signed-off-by: Taku Izumi <admin@orz-style.com>
* use charset_normalizer to determine response encoding
Co-authored-by: Taku Izumi <admin@orz-style.com>
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>