You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on uploading files, and I ran into a case where two PPT files had the same text content but had other differences when opened in PowerPoint. However, since the text was the same it was marked as a duplicate.
The immediate problem was that the track id for the "duplicate" didn't resolve to anything, so there was no way to know what happened through the api.
Talking about this with a colleague, we thought maybe the doc id could be based on the md5 of the whole file and not just the text content. You would have duplicate chunks, and everything, but it is an option. The other idea was possibly adding a status for something like "duplicate" and be like a failure - don't insert/chunk/etc, but the track id would resolve, and a status would be available for review.
Anyone else run into issues like this? Willing to hear more feedback on this and how to possibly handle this.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I'm working on uploading files, and I ran into a case where two PPT files had the same text content but had other differences when opened in PowerPoint. However, since the text was the same it was marked as a duplicate.
The immediate problem was that the track id for the "duplicate" didn't resolve to anything, so there was no way to know what happened through the api.
Talking about this with a colleague, we thought maybe the doc id could be based on the md5 of the whole file and not just the text content. You would have duplicate chunks, and everything, but it is an option. The other idea was possibly adding a status for something like "duplicate" and be like a failure - don't insert/chunk/etc, but the track id would resolve, and a status would be available for review.
Anyone else run into issues like this? Willing to hear more feedback on this and how to possibly handle this.
Beta Was this translation helpful? Give feedback.
All reactions